name: semtl-sep25 class: title, middle ### Generative and active machine learning for scientific discoveries Alex Hernández-García (he/il/él) .turquoise[[Software Engineering at Montreal](https://semtl.github.io/meeting/2025/08/26/UdeM/) · Université de Montréal · September 22nd 2025] .center[
    
] .center[
    
] .smaller[.footer[ Slides: [alexhernandezgarcia.github.io/slides/{{ name }}](https://alexhernandezgarcia.github.io/slides/{{ name }}) ]] .qrcode[] --- count: false name: title class: title, middle ### Motivation: facilitating scientific discoveries .center[] --- ## Why scientific discoveries? .context[Climate change is a major challenge for humanity.]
.center[
.smaller[Observed (1900–2020) and projected (2021–2100) changes in global surface temperature relative to 1850–1900 (adapted from:
IPCC Sixth Assessment Report
)]
] .conclusion["The evidence is clear: the time for action is now." .smaller[IPCC Report, 2022]] --- ## Why scientific discoveries? .context[Climate change is a major challenge for humanity.] .center[
.smaller[Climate-sensitive health risks (adapted from:
World Health Organization
)]
] .smaller[ * Environmental factors take the lives of around 13 million people _per year_. * Climate change affects people’s mental and physical health, access to clean air, safe water, food and health care. ] .full-width[ .conclusion["Climate change is the single biggest health threat facing humanity." .smaller[[WHO and WMO](https://climahealth.info/), 2024]] ] --- ## Why scientific discoveries? ### The potential of materials discovery .context["The time for action is now"] -- > "Limiting global warming will require major transitions in the energy sector. This will involve a substantial reduction in fossil fuel use, widespread electrification, .highlight1[improved energy efficiency, and use of alternative fuels (such as hydrogen)]." .cite[IPCC Sixth Assessment Report, 2022] > "Reducing industry emissions will entail coordinated action throughout value chains to promote all mitigation options, including demand management, .highlight1[energy and materials efficiency, circular material flows]." .cite[IPCC Sixth Assessment Report, 2022] --
.conclusion[Mitigation of the climate crisis requires innovation in the materials sector.] ??? Antimicrobial resistance - https://www.who.int/news-room/fact-sheets/detail/antimicrobial-resistance - https://www.who.int/news-room/feature-stories/detail/donors-making-a-difference--climate-change-and-its-impact-on-health - https://www.who.int/news/item/31-10-2022-who-and-wmo-launch-a-new-knowledge-platform-for-climate-and-health - https://www.who.int/news/item/08-02-2024-who-medically-important-antimicrobial-list-2024 - https://cdn.who.int/media/docs/default-source/gcp/who-mia-list-2024-lv.pdf?sfvrsn=3320dd3d_2 - https://www.who.int/publications/i/item/9789240047655 --- ## Why scientific discoveries? ### The potential of drug discovery .context[Drug discovery and vaccine development play a crucial role in modern healthcare systems.] .right-column-33[ .center[] ] --- count: false ## Why scientific discoveries? ### The potential of drug discovery .context[Drug discovery and vaccine development play a crucial role in modern healthcare systems.] .right-column-33[ .center[] ] .left-column-66[ .highlight1[Bacterial antimicrobial resistance] contributed to 4.95 million deaths in 2019. .cite[World Health Organisation (WHO), 2023] WHO's latest annual review identified 27 antibiotics in clinical development that address WHO bacterial priority pathogens, of which .highlight1[only 6 were classified as innovative]. "The recently approved antibacterial agents are .highlight1[insufficient to tackle the challenge] of increasing emergence and spread of antimicrobial resistance". .cite[World Health Organisation (WHO), 2021] ] --- count: false ## Why scientific discoveries? ### The potential of drug discovery .context[Drug discovery and vaccine development play a crucial role in modern healthcare systems.] .right-column-33[ .center[
"No time to wait". Source:
WHO
.
] ] .left-column-66[ .highlight1[Bacterial antimicrobial resistance] contributed to 4.95 million deaths in 2019. .cite[World Health Organisation (WHO), 2023] WHO's latest annual review identified 27 antibiotics in clinical development that address WHO bacterial priority pathogens, of which .highlight1[only 6 were classified as innovative]. "The recently approved antibacterial agents are .highlight1[insufficient to tackle the challenge] of increasing emergence and spread of antimicrobial resistance". .cite[World Health Organisation (WHO), 2021] ] .full-width[ .conclusion["No time to wait". Alongside other necessary actions, drug discovery plays a key role in tackling the antimicrobial resistance global threat.] ] --- ## Machine Learning for Science .center[] .conclusion[Machine learning research has the potential to facilitate scientific discoveries to tackle climate and health challenges.] --- count: false ## Machine Learning for Science and Science for Machine Learning .center[] .conclusion[Machine learning research has the potential to facilitate scientific discoveries to tackle climate and health challenges. Scientific challenges stimulate in turn machine learning research.] --- ## Outline -- - [Introduction: Generative and active learning for scientific discoveries](#mlforscience) -- - [Brief intro to GFlowNets](#gflownets) -- - [Crystal-GFN: materials discovery](#crystal-gfn) -- - [Sampling molecular conformations](#conformers) --- count: false name: mlforscience class: title, middle ### **Generative** and **active** learning for scientific discoveries .center[] --- ## Traditional discovery cycle .context35[The climate crisis demands accelerating scientific discoveries.] -- .right-column-66[
.center[]] .left-column-33[
The .highlight1[traditional pipeline] for scientific discovery: * relies on .highlight1[highly specialised human expertise], * it is .highlight1[time-consuming] and * .highlight1[financially and computationally expensive]. ] --- count: false ## _Active_ machine learning .context35[The traditional scientific discovery loop is too slow for certain applications.] .right-column-66[
.center[]] .left-column-33[
A .highlight1[machine learning model] can be: * trained with data from _real-world_ experiments and ] --- count: false ## _Active_ machine learning .context35[The traditional scientific discovery loop is too slow for certain applications.] .right-column-66[
.center[]] .left-column-33[
A .highlight1[machine learning model] can be: * trained with data from _real-world_ experiments and * used to quickly and cheaply evaluate queries ] --- count: false ## _Active_ machine learning .context35[The traditional scientific discovery loop is too slow for certain applications.] .right-column-66[
.center[]] .left-column-33[
A .highlight1[machine learning model] can be: * trained with data from _real-world_ experiments and * used to quickly and cheaply evaluate queries .conclusion[There are infinitely many conceivable materials and combinatorially many molecules. Are predictive models enough?] ] --- count: false ## Active and _generative_ machine learning .right-column-66[
.center[]] .left-column-33[
.highlight1[Generative machine learning] can: * .highlight1[learn patterns] from the available data, * .highlight1[generalise] to unexplored regions of the search space and * .highlight1[build better queries] ] -- .left-column-33[ .conclusion[Active learning with generative machine learning can in theory more efficiently explore the candidate space.] ] --- count: false name: title class: title, middle ### The challenges of scientific discoveries .center[] .center[] --- ## An intuitive trivial problem .highlight1[Problem]: find one arrangement of Tetris pieces on the board that minimise the empty space. .left-column-33[ .center[] ] .right-column-66[ .center[] ] -- .full-width[.center[
Score: 12
]] --- count: false ## An intuitive ~~trivial~~ easy problem .highlight1[Problem]: find .highlight2[all] the arrangements of Tetris pieces on the board that minimise the empty space. .left-column-33[ .center[] ] .right-column-66[ .center[] ] -- .full-width[.center[
12
12
12
12
12
]] --- count: false ## An intuitive ~~easy~~ hard problem .highlight1[Problem]: find .highlight2[all] the arrangements of Tetris pieces on the board that minimise the empty space. .left-column-33[ .center[] ] .right-column-66[ .center[] ] -- .full-width[.center[
]] --- count: false ## An incredibly ~~intuitive easy~~ hard problem .highlight1[Problem]: find .highlight2[all] the arrangements of Tetris pieces on the board that .highlight2[optimise an unknown function]. .left-column-33[ .center[] ] .right-column-66[ .center[] ] -- .full-width[.center[
]] --- count: false ## An incredibly ~~intuitive easy~~ hard problem .highlight1[Problem]: find .highlight2[all] the arrangements of Tetris pieces on the board that .highlight2[optimise an unknown function]. .left-column-33[ .center[] ] .right-column-66[ .center[] ] .full-width[.conclusion[Materials and drug discovery involve finding candidates with rare properties from combinatorially or infinitely many options.]] --- ## Actual scientific discovery problems .context35[The "Tetris problem" involves .highlight1[sampling from an unknown distribution] in a .highlight1[discrete, high-dimensional, combinatorially large space].] --- count: false ## Actual scientific discovery problems ### Biological sequence design
Proteins, antimicrobial peptides (AMP) and DNA can be represented as sequences of amino acids or nucleobases. There are $22^{100} \approx 10^{134}$ protein sequences with 100 amino acids. .context35[The "Tetris problem" involves sampling from an unknown distribution in a discrete, high-dimensional, combinatorially large space] .center[] -- .left-column-66[ .dnag[`G`].dnaa[`A`].dnag[`G`].dnag[`G`].dnag[`G`].dnac[`C`].dnag[`G`].dnaa[`A`].dnac[`C`].dnag[`G`].dnag[`G`].dnat[`T`].dnaa[`A`].dnac[`C`].dnag[`G`].dnag[`G`].dnaa[`A`].dnag[`G`].dnac[`C`].dnat[`T`].dnac[`C`].dnat[`T`].dnag[`G`].dnac[`C`].dnat[`T`].dnac[`C`].dnac[`C`].dnag[`G`].dnat[`T`].dnat[`T`].dnaa[`A`]
.dnat[`T`].dnac[`C`].dnaa[`A`].dnac[`C`].dnac[`C`].dnat[`T`].dnac[`C`].dnac[`C`].dnac[`C`].dnag[`G`].dnaa[`A`].dnag[`G`].dnac[`C`].dnaa[`A`].dnaa[`A`].dnat[`T`].dnaa[`A`].dnag[`G`].dnat[`T`].dnat[`T`].dnag[`G`].dnat[`T`].dnaa[`A`].dnag[`G`].dnag[`G`].dnac[`C`].dnaa[`A`].dnag[`G`].dnac[`C`].dnag[`G`].dnat[`T`].dnac[`C`].dnac[`C`].dnat[`T`].dnaa[`A`].dnac[`C`].dnac[`C`].dnag[`G`].dnat[`T`].dnat[`T`].dnac[`C`].dnag[`G`]
.dnac[`C`].dnat[`T`].dnaa[`A`].dnac[`C`].dnag[`G`].dnac[`C`].dnag[`G`].dnat[`T`].dnac[`C`].dnat[`T`].dnac[`C`].dnat[`T`].dnat[`T`].dnat[`T`].dnac[`C`].dnag[`G`].dnag[`G`].dnag[`G`].dnag[`G`].dnag[`G`].dnat[`T`].dnat[`T`].dnaa[`A`]
.dnat[`T`].dnat[`T`].dnag[`G`].dnac[`C`].dnaa[`A`].dnag[`G`].dnaa[`A`].dnag[`G`].dnag[`G`].dnat[`T`].dnat[`T`].dnaa[`A`].dnaa[`A`].dnac[`C`].dnag[`G`].dnac[`C`].dnag[`G`].dnac[`C`].dnaa[`A`].dnat[`T`].dnag[`G`].dnac[`C`].dnag[`G`].dnaa[`A`].dnac[`C`].dnat[`T`].dnag[`G`].dnag[`G`].dnag[`G`].dnag[`G`].dnat[`T`].dnat[`T`].dnaa[`A`].dnag[`G`].dnat[`T`].dnaa[`A`].dnag[`G`].dnat[`T`].dnac[`C`].dnag[`G`].dnaa[`A`].dnaa[`A`].dnac[`C`].dnaa[`A`].dnat[`T`].dnaa[`A`].dnat[`T`].dnaa[`A`].dnat[`T`].dnat[`T`].dnag[`G`].dnaa[`A`].dnat[`T`].dnaa[`A`].dnaa[`A`].dnaa[`A`].dnac[`C`].dnaa[`A`]
.dnag[`G`].dnac[`C`].dnat[`T`].dnac[`C`].dnag[`G`].dnac[`C`].dnat[`T`].dnat[`T`].dnaa[`A`].dnag[`G`].dnag[`G`].dnag[`G`].dnac[`C`].dnac[`C`].dnat[`T`].dnac[`C`].dnag[`G`].dnaa[`A`].dnac[`C`].dnat[`T`].dnac[`C`].dnac[`C`].dnat[`T`].dnac[`C`].dnat[`T`].dnag[`G`].dnaa[`A`].dnaa[`A`].dnat[`T`].dnag[`G`].dnag[`G`].dnaa[`A`].dnag[`G`].dnat[`T`].dnag[`G`].dnat[`T`].dnat[`T`].dnac[`C`].dnaa[`A`].dnat[`T`].dnac[`C`].dnag[`G`].dnaa[`A`].dnaa[`A`].dnat[`T`].dnag[`G`].dnag[`G`].dnaa[`A`].dnag[`G`].dnat[`T`].dnag[`G`]
] --- ## Actual scientific discovery problems ### Molecular generation .context35[The "Tetris problem" involves sampling from an unknown distribution in a discrete, high-dimensional, combinatorially large space]
Small molecules can also be represented as sequences or by a combination of of higher-level fragments. There may be about $10^{60}$ drug-like molecules. -- .columns-3-left[ .center[  `CC(=O)NCCC1=CNc2c1cc(OC)cc2 CC(=O)NCCc1c[nH]c2ccc(OC)cc12` ]] .columns-3-center[ .center[  `OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N` ]] .columns-3-right[ .center[  `CN1CCC[C@H]1c2cccnc2` ]] --- ## Machine learning for scientific discoveries ### Summary of main challenges -- .highlight1[Challenge]: very large and high-dimensional search spaces. -- → Need for .highlight2[efficient search and generalisation] of underlying structure. -- .highlight1[Challenge]: highly structured, discrete and continuous objects -- → Need for .highlight2[generators or samplers designed] for structured data. -- .highlight1[Challenge]: underspecification of objective functions or metrics. -- → Need for .highlight2[diverse candidates]. -- .conclusion[We want to discover diverse high-scoring candidates in very large, structured spaces.] --- ## ML for scientific discoveries ### Available methods .context35[We want to discover diverse high-scoring candidates in very large, structured spaces.] -- .center[What methods can address these challenges?] -- - .highlight1[Reinforcement learning] excels at optimisation in complex spaces, but tends to lack diversity. -- - .highlight1[Traditional sampling methods (MCMC)] provide diversity (sampling), but struggle at mode mixing in high dimensions. -- - .highlight1[Diffusion] excels at learning from data at scale and sampling in continuous spaces, but is limited at leveraging compositional structure. -- .conclusion[Generative flow networks (GFlowNets) combine multiple advantages: **sampling as sequential decision making**.] --- count: false name: gflownets class: title, middle ### A brief intro to GFlowNets .center[] --- ## GFlowNets for science ### 3 key ingredients .context[Materials and drug discovery involve .highlight1[sampling from unknown distributions] in .highlight1[discrete or mixed, high-dimensional, combinatorially large spaces.]] --
1. .highlight1[Diversity] as an explicit objective. -- - Given a score or reward function $R(x)$, learn to _sample proportionally to the reward_. -- 2. .highlight1[Compositionality] in the sample generation. -- - A meaningful decomposition of samples $x$ into multiple sub-states $s_0\rightarrow s_1 \rightarrow \dots \rightarrow x$ can yield generalisable patterns. -- 3. .highlight1[Deep learning] to learn from the generated samples. -- - A machine learning model can learn the transition function $F(s\rightarrow s')$ and generalise the patterns. --- ## 1. Diversity as an objective .context[Many existing approaches treat scientific discovery as an _optimisation_ problem.]
Given a reward or objective function $R(x) \geq 0$, GFlowNet can be seen a generative model trained to sample objects $x \in \cal X$ according to .highlight1[a sampling policy $p(x)$ proportional to the reward $R(x)$]: -- .left-column[ $$p(x) = \frac{R(x)}{Z} \propto R(x)$$ ] .right-column[ $$Z = \sum_{x' \in \cal X} R(x')$$ ] -- .full-width[ .center[                                     ]] --- count: false ## 1. Diversity as an objective .context[Many existing approaches treat scientific discovery as an _optimisation_ problem.]
Given a reward or objective function $R(x) \geq 0$, GFlowNet can be seen a generative model trained to sample objects $x \in \cal X$ according to .highlight1[a sampling policy $p(x)$ proportional to the reward $R(x)$]: .left-column[ $$p(x) = \frac{R(x)}{Z} \propto R(x)$$ ] .right-column[ $$Z = \sum_{x' \in \cal X} R(x')$$ ] .full-width[ → Sampling proportionally to the reward function enables finding .highlight1[multiple modes], hence .highlight1[diversity]. .center[] ] --- count: false ## 1. Diversity as an objective .context[Many existing approaches treat scientific discovery as an _optimisation_ problem.]
Given a reward or objective function $R(x) \geq 0$, GFlowNet can be seen a generative model trained to sample objects $x \in \cal X$ according to .highlight1[a sampling policy $p(x)$ proportional to the reward $R(x)$]: .left-column[ $$p(x) = \frac{R(x)}{Z} \propto R(x)$$ ] .right-column[ $$Z = \sum_{x' \in \cal X} R(x')$$ ] .full-width[ .conclusion[In GFlowNets, the density $p(x)$ is primarily modelled via a reward function, not via data and $p_{data}(x)$.] ] --- ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] The principle of compositionality is fundamental in semantics, linguistics, mathematical logic and is thought to be a cornerstone of human reasoning. --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. -- .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] .right-column[
.conclusion[The decomposition of the sampling process into meaningful steps yields patterns that may be correlated with the reward function and facilitates learning complex distributions.] ] --- count: false ## 2. Compositionality ### Sample generation process .context35[Sampling _directly_ from a complex, high-dimensional distribution is difficult.] For the Tetris problem, a meaningful decomposition of the samples is .highlight1[adding one piece to the board at a time]. .left-column[.center[]] .right-column[ Objects $x \in \cal X$ are constructed through a sequence of actions from an .highlight1[action space $\cal A$]. ] .right-column[ At each step of the .highlight1[trajectory $\tau=(s_0\rightarrow s_1 \rightarrow \dots \rightarrow s_f)$], we get a partially constructed object $s$ in .highlight1[state space $\cal S$]. ] -- .right-column[ .conclusion[These ideas and terminology is reminiscent of reinforcement learning (RL).] ] --- ## 3. Deep learning policy .context35[GFlowNets learn a sampling policy $p\_{\theta}(x)$ proportional to the reward $R(x)$.] -- .left-column[ .center[] ] --- count: false ## 3. Deep learning policy .context35[GFlowNets learn a sampling policy $p\_{\theta}(x)$ proportional to the reward $R(x)$.] .left-column[ .center[] ] .right-column[
Deep neural networks are trained to learn the transitions (flows) policy: $F\_{\theta}(s\_t\rightarrow s\_{t+1})$. ] -- .right-column[ Consistent flow theorem (informal): if the sum of the flows into state $s$ is equal to the sum of the flows out, then $p(x) \propto R(x)$. ] .references[ Bengio et al. [Flow network based generative models for non-iterative diverse candidate generation](https://arxiv.org/abs/2106.04399), NeurIPS, 2021. ] -- .right-column[ .conclusion[GFlowNets can be trained with deep learning methods to learn a sampling policy $p\_{\theta}$ proportional to a reward $R(x)$.] ] --- ## Review paper A review of the potential of GFlowNets for AI-driven scientific discoveries. .center[] .references[ Jain et al. [GFlowNets for AI-Driven Scientific Discovery](https://pubs.rsc.org/en/content/articlelanding/2023/dd/d3dd00002h). Digital Discovery, Royal Society of Chemistry, 2023. ] --- ## GFlowNet Python package .right-column-66[Open sourced GFlowNet package, together with Mila collaborators: Nikita Saxena, Alexandra Volokhova, Michał Koziarski, Divya Sharma, Pierre Luc Carrier, Victor Schmidt, Joseph Viviano. .highlight2[Open source package]: [github.com/alexhernandezgarcia/gflownet](https://github.com/alexhernandezgarcia/gflownet) * A key design principle is the simplicity to create new environments, new applications. * Current environments: Tetris, hyper-grid, hyper-cube, hyper-torus, scrabble, crystals, molecules, DNA, decision trees... * Discrete and continuous environments, multiple loss functions, etc. * Visualisation of results on WandDB ] .left-column-33[ .center[] ] .qrcode[] --- count: false name: crystal-gfn class: title, middle ## Crystal-GFN: GFlowNets for materials discovery Mila AI4Science: Alex Hernandez-Garcia, Alexandre Duval, Alexandra Volokhova, Yoshua Bengio, Divya Sharma, Pierre Luc Carrier, Yasmine Benabed, Michał Koziarski, Victor Schmidt, Pierre-Paul De Breuck .smaller70[Mila AI4Science et al. [Crystal-GFN: sampling crystals with desirable properties and constraints](https://arxiv.org/abs/2310.04925). AI4Mat, NeurIPS 2023 (spotlight).] .center[] --- ## What are crystals? Definition: A crystal or crystalline solid is a solid material whose constituents (such as atoms, molecules, or ions) are arranged in a .highlight1[highly ordered microscopic structure], forming .highlight1[a crystal lattice that extends in all directions]. .left-column[ .center[] ] .right-column[ .center[] ] -- Here, we are concerned mainly with _inorganic crystals_, where the constituents are atoms or ions. -- A crystal structure is characterized by its .highlight1[unit cell], a small imaginary box containing atoms in a specific spatial arrangement with certain symmetry. The unit cell repeats iself periodically in all directions. --- ## Why do we care about crystals? .context35[Materials discovery can help reduce emissions in multiple sectors.] --
Many solid state materials are crystal structures and they are a core component of: * Electrocatalysts for fuel cells, hydrogen storage, industrial chemical reactions, carbon capture, etc. * Solid electrolytes for batteries. * Thin film materials for photovoltaics. * ... -- However, .highlight1[material modelling is very challenging]: * Limited data: only about 200 K known inorganic materials, while there are infinitely many conceivable materials (for reference: more than a billion molecules are known) * Sparsity: .highlight2[stable materials] only exist in a low-dimensional subspace of all possible 3D arrangements. -- .conclusion[There is a need for efficient generative models of crystal structures.] --- ## A domain-inspired approach ### Crystal structure parameters .context[Most previous works tackle crystal structure generation in the space of atomic coordinates and struggle to preserve the symmetry properties.] --
Instead of optimising the atom positions by learning from a small data set, we draw .highlight1[inspiration from theoretical crystallography to sample crystals in a lower-dimensional space of crystal structure parameters]. -- .highlight2[Space group]: symmetry operations of a repeating pattern in space that leave the pattern unchanged. -- - There are 17 symmetry groups in 2 dimensions (wallpaper groups). - There are 230 space groups in 3 dimensions. --- count: false ## A domain-inspired approach ### Crystal structure parameters .context[Most previous works tackle crystal structure generation in the space of atomic coordinates and struggle to preserve the symmetry properties.]
Instead of optimising the atom positions by learning from a small data set, we draw .highlight1[inspiration from theoretical crystallography to sample crystals in a lower-dimensional space of crystal structure parameters]. .highlight2[Lattice system]: all 230 space groups can be classified into one of the 7 lattices. .center[ 






] --- count: false ## A domain-inspired approach ### Crystal structure parameters .context[Most previous works tackle crystal structure generation in the space of atomic coordinates and struggle to preserve the symmetry properties.]
Instead of optimising the atom positions by learning from a small data set, we draw .highlight1[inspiration from theoretical crystallography to sample crystals in a lower-dimensional space of crystal structure parameters]. .highlight2[Lattice parameters]: The lattice's size and shape is characterised by 6 parameters: .highlight1[$a, b, c, \alpha, \beta, \gamma$]. .center[] --- ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] --- count: false ## Crystal-GFlowNet ### Sequential generation .center[] .conclusion[Crystal-GFN binds multiple spaces representing crystallographic and material properties, setting intra- and inter-space hard constraints in the generation process.] --- ## GFlowNet approach ### Advantages .context[We generate materials in the lower-dimensional space of crystal structure parameters.] * Constructing materials by their crystal structure parameters allows us to introduce .highlight1[physicochemical and geometric _hard_ constraints]: -- * Charge neutrality of the composition. * Compatibility of composition and space group. * Hierarchical structure of the space group. * Compatibility of lattice parameters and lattice system. -- * .highlight1[Searching in the lower-dimensional space] of crystal structure parameters may be more efficient than in the space of atom coordinates. -- * Provided we have access to a predictive model of a material property, we can .highlight1[flexibly generate materials with desirable properties]. -- * We can .highlight1[flexibly sample materials with specific characteristics, such as composition or space group]. -- * Training the generative model does not depend on a data set, but on a proxy model of the property of interest. --- ## Crystal-GFlowNet ### Material properties We can train a Crystal-GFN with any reward function, provided it is computationally tractable. Therefore, we can use it to .highlight1[generate materials with different properties]. -- We have tested the following properties: - .highlight2[Formation energy] per atom [eV/atom], via a pre-trained machine learning model: indicative of the material's stability. -- - .highlight2[Electronic band gap] [eV] (squared distance to a target value, 1.34 eV), via a pre-trained machine learning model: relevant in photovoltaics, for instance. -- - Unit cell .highlight2[density] [g/cm
3
]: convenient as a proof of concept because we can calculate it _exactly_ from the GFN outputs. --- count: false ## Crystal-GFlowNet ### Material properties We can train a Crystal-GFN with any reward function, provided it is computationally tractable. Therefore, we can use it to .highlight1[generate materials with different properties]. We have tested the following properties: - .highlight2[Formation energy] per atom [eV/atom], via a pre-trained machine learning model: indicative of the material's stability. - .highlight2[Electronic band gap] [eV] (squared distance to a target value, 1.34 eV), via a pre-trained machine learning model: relevant in photovoltaics, for instance. - .alpha50[Unit cell .highlight2[density] [g/cm
3
]: convenient as a proof of concept because we can calculate it _exactly_ from the GFN outputs.] --- ## Results ### Formation energy .context35[The formation energy correlates with stability. The lower, the better.] .center[] --- count: false ## Results ### Formation energy .context35[The formation energy correlates with stability. The lower, the better.] .center[] --- count: false ## Results ### Formation energy .context35[The formation energy correlates with stability. The lower, the better.] .center[] --- count: false ## Results ### Formation energy .context35[The formation energy correlates with stability. The lower, the better.] .center[] --- count: false ## Results ### Formation energy .context[.highlight1[After training, Crystal-GFN samples structures with even lower formation energy [eV/atom] than the validation set.]] .center[] --- ## Results ### Band gap .context35[We aimed at sampling structures with band gap close to 1.34 eV.] .center[] --- count: false ## Results ### Band gap .context35[We aimed at sampling structures with band gap close to 1.34 eV.] .center[] --- count: false ## Results ### Band gap .context35[We aimed at sampling structures with band gap close to 1.34 eV.] .center[] --- count: false ## Results ### Band gap .context35[We aimed at sampling structures with band gap close to 1.34 eV.] .center[] --- count: false ## Results ### Band gap .context[.highlight1[After training, Crystal-GFN samples structures with band gap [eV] around the target value.]] .center[] --- ## Results ### Diversity .context[.highlight2[Diversity] is key in materials discovery.] Analysis of 10,000 sampled crystals and the top-100 with lowest formation energy. -- - All 10,000 samples are unique. -- - All crystal systems, lattice systems and point symmetries found in the 10,000 samples. - 4 out of 8 crystal-lattice systems in the top-100. - 4 out of the 5 point symmetries in the top-100. -- - All 22 elements found in the 10,000 samples. - 15 out of 22 elements in the top-100. -- - 73 out of 113 space groups (65 %) found in the 10,000 samples - 19 out of 113 space groups in the top-100. -- .conclusion[Crystal-GFN samples are highly diverse.] --- count: false name: conformers class: title, middle ## Sampling molecular conformations .highlight2[Alexandra Volokhova], .highlight2[Léna Néhale-Ezzine], Michał Koziarski, Piotr Gaiński, Cheng-Hao Liu, Luca Scimeca, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Emmanuel Bengio, Prudencio Tossou, Alán Aspuru-Guzik, Yoshua Bengio .smaller70[Volokhova, Koziarski, et al. [Towards equilibrium molecular conformation generation with GFlowNets](https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00023d). Digital Discovery (2024)] .smaller70[Volokhova, Nehale-Ezzine, et al. [Torsional-GFN: a conditional conformation generator for small molecules](https://arxiv.org/abs/2507.11759). arXiv:2507.11759 (2025)] .center[] --- ## Sampling molecular conformations Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. Goal: given a molecular graph $G$, to sample conformations from the Boltzmann distribution, as determined by the molecule's energy. -- .right-column[.center[]] .left-column[ We represent molecules by their intrinsic properties: - Torsion angles - Bond lengths - Bond angles ] --- count: false ## Sampling molecular conformations Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. Goal: given a molecular graph $G$, to sample conformations from the Boltzmann distribution, as determined by the molecule's energy. .right-column[.center[]] .left-column[ We represent molecules by their intrinsic properties: - .highlight1[Torsion angles]: responsible for most of the variance in the conformational space - Bond lengths: considered constant (local structure) - Bond angles: considered constant (local structure) ] --- ## Sampling molecular conformations _with GFlowNets_ GFlowNets are amortised samplers designed to sample proportionally to a reward function. Here, we used the molecular energy as reward function: Semi-empirical (GFN2-xTB), force field (GFN-FF), neural network potential (TorchANI). --
1. Input: a small molecule represented as a SMILES. 2. Obtain molecular graph and _local structure_ from RDKit. 3. Train a GFlowNet to sample adjustments of the torsion angles of the molecular graph. --- ## Results: 2D experiments .center[] .conclusion[GFlowNet nearly perfectly learns to sample proportionally to the energy distribution in molecules with two torsion angles.] --- ## Results: multiple torsion angles .center[] --- count: false ## Results: multiple torsion angles .center[] .conclusion[GFlowNet can also learn the energy distributions over multiple torsion angles, better than relying solely on RDKit.] --- ## Preliminary conclusions - GFlowNet can sample diverse low-energy conformations of drug-like molecules with multiple torsion angles. - GFlowNet can learn various energy landscapes with different energy estimators. Paper: Volokhova, Koziarski, et al. [Towards equilibrium molecular conformation generation with GFlowNets](https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00023d). Digital Discovery (2024). -- - Since a GFlowNet needs to be trained for each molecule, is this practically useful at all? - Is MCMC not as good as GFlowNets for this task? .conclusion[Besides amortising each energy evaluation for a given molecule, GFlowNets can potentially _generalise to unseen molecules_.] --- ## Torsional-GFN ### Conditional conformation generation common to all molecules
1. Input: a data set of small molecules represented as a SMILES. 2. Obtain molecular graph and _local structure_ from molecular dynamics. 3. Train a GFlowNet to sample adjustments of the torsion angles of the molecular graph. 4. At sampling time, any SMILES (even from unseen molecules) can be passed as an input. --- ## Results - We trained Torsional‐GFN on a subset of 6 molecules, each with 2 rotatable torsion angles, from the FreeSolv dataset. - Local structures fixed to the values of one arbitrary conformation from the MD simulation dataset. - Energy function: MMFF94. -- .left-column[ .center[
Local structure in train set.
]] .right-column[ .center[
Unseen local structure during training.
]] --- ## Results ### Energy histograms .center[] --- ## Conclusions - Generative and active machine learning can contribute to scientific discoveries. - Generative flow networks (GFlowNets) address some of the challenges faced by generative machine learning for scientific exploration. - Two examples are Crystal-GFN, to design crystal structures for materials discovery and Torsional-GFN, to sample from the Boltzmann distribution of molecules. - All this requires large doses of software engineering! .references[ - Jain et al. [GFlowNets for AI-Driven Scientific Discovery](https://pubs.rsc.org/en/content/articlelanding/2023/dd/d3dd00002h). Digital Discovery, Royal Society of Chemistry, 2023. - Mila AI4Science et al. [Crystal-GFN: sampling crystals with desirable properties and constraints](https://arxiv.org/abs/2310.04925). AI4Mat, NeurIPS 2023 (spotlight). - Volokhova, Koziarski, et al. [Towards equilibrium molecular conformation generation with GFlowNets](https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00023d). Digital Discovery (2024) - Volokhova, Nehale-Ezzine, et al. [Torsional-GFN: a conditional conformation generator for small molecules](https://arxiv.org/abs/2507.11759). arXiv:2507.11759 (2025) ] --- ## Acknowledgements .columns-3-left[ Céline Roget
Divya Sharma
Lena Podina
Pierre-Louis Lemaire
] .columns-3-center[ Dounia Shaaban Kabakibo
Shahana Shatterjee
Leah Wairimu
Ameer Nizami
] .columns-3-right[ Hyeonah Kim
Felix Therrien
Om Patel
Jacopo Ghirri
] .full-width[Crystal-GFN: Alexandre Duval, Alexandra Volokhova, Yoshua Bengio, Divya Sharma, Pierre Luc Carrier, Yasmine Benabed, Michał Koziarski, Victor Schmidt, Pierre-Paul De Breuck] .full-width[Molecular conformations: Alexandra Volokhova, Léna Néhale-Ezzine, Michał Koziarski, Piotr Gaiński, Cheng-Hao Liu, Luca Scimeca, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Emmanuel Bengio, Prudencio Tossou, Alán Aspuru-Guzik, Yoshua Bengio] .full-width[.conclusion[Science is a lot more fun when shared with bright and interesting people!]] --- name: semtl-sep25 class: title, middle  Alex Hernández-García (he/il/él) .center[
    
    
    
] .footer[[alexhernandezgarcia.github.io](https://alexhernandezgarcia.github.io/) | [alex.hernandez-garcia@mila.quebec](mailto:alex.hernandez-garcia@mila.quebec)] | [alexhergar.bsky.social](https://bsky.app/profile/alexhergar.bsky.social) [](https://bsky.app/profile/alexhergar.bsky.social)
.smaller[.footer[ Slides: [alexhernandezgarcia.github.io/slides/{{ name }}](https://alexhernandezgarcia.github.io/slides/{{ name }}) ]]