GWAS and Machine Learning

Jun 15, 2026

GWAS and Machine Learning: How Bionomeex Mapped 80 Billion Genetic Interactions

Standard GWAS has a ceiling. After a decade of genome-wide association studies, geneticists can explain only a fraction of what genes should theoretically control. The gap has a name missing heritability and a cause: GWAS analyzes one genetic variant at a time, while most complex traits emerge from interactions between variants. Mapping those interactions at genome scale was computationally out of reach. Until it wasn't.

In March 2024, Bionomeex published Next-Gen GWAS (NGG) in Genome Biology. The paper describes a machine learning approach that evaluates over 80 billion pairwise SNP interactions within hours, producing the first complete two-dimensional epistatic maps at gene resolution.

The Problem With Standard GWAS

A genome with 350,000 SNPs modest by current standards contains roughly 80 billion possible pairwise interactions. Standard statistical models scale exponentially with variant count. Computing the full epistatic space for one phenotype using conventional tools would take years. Researchers have worked around this by pre-selecting candidate variants before testing their interactions. The bias is structural: pre-selection can only find what it already suspects.

Current GWAS statistical models require years to assess the entire combinatorial epistatic space for a single phenotype. The most recent developments consist of genetic variable pre-selection or algorithmic acceleration but an attempt to solve large epistatic maps without variable selection was still lacking.

NGG removes the pre-selection requirement. It solves the full interaction space directly.

The Mathematical Breakthrough - Compressed Sensing

The core of NGG is compressed sensing, a signal processing technique Bionomeex's team mathematicians Clément Carré and André Mas, and CNRS geneticist Gabriel Krouk applied to genomics for the first time at this scale.

Compressed sensing allows signal reconstruction from far fewer measurements than classical methods require, provided the signal is sparse. Bionomeex hypothesized that genetic data exhibits such properties enabling strong compression of the epistatic problem and an important acceleration of the process.

The sparsity assumption holds: for any given phenotype, most of the 80 billion SNP interactions have zero effect. The true epistatic signals are sparse within an enormous combinatorial space. This makes the problem mathematically compressible.

On hardware, NGG runs on GPU architecture. NGG outperforms permGWAS the fastest GWAS algorithm at publication by at least two orders of magnitude. For 2 million SNPs, NGG completes in approximately 5 seconds. permGWAS requires over 460 seconds for the same dataset.

The 80-billion-interaction computation now fits within a few hours.

What NGG Produces

The output is not a ranked list of p-values. It is a complete two-dimensional interaction matrix every position records the estimated effect of a pairwise SNP combination on a given phenotype.

One 2D-NGG result dataset represents approximately 500 gigabytes of data. To navigate it, Bionomeex developed LuXiol a visualization tool that functions as a "Google Earth" for full epistatic maps. Results are organized in zoomable layers: a single pixel at the highest zoom-out level reports the maximum intensity of 4.2 million SNP/SNP interactions in the layers below.

LuXiol was co-funded by BPI France through an Émergence grant. Without it, the 500GB output of each NGG run would be scientifically unnavigable.

What the Paper Validates

The Genome Biology paper tests NGG on three separate questions.

Does it retrieve known GWAS signals? Applied to 107 Arabidopsis thaliana phenotypes from a landmark dataset, NGG retrieves the same major signals as EMMA, the standard mixed-model GWAS algorithm. For flowering phenotypes, NGG ranks FLOWERING LOCUS C a well-characterized major gene as the second strongest signal. EMMA places it 30th.

Does it recover missing heritability? For 16 of 18 phenotypes tested, a good proportion of heritability is retrieved in the 2D epistatic signals. For phosphorus content in Arabidopsis leaves, heritability measured by 1D GWAS ranges around 22%. It increases to 33% when 2D-GWAS information is included.

Does it improve phenotypic prediction? When machine learning models including Deep Neural Networks, Support Vector Machines, Random Forest, and Gradient Boosting receive epistatic signals from NGG rather than randomly selected variants, 57% show improved predictive performance. With only the top 30 1D plus top 30 2D signals, 80% of models improve.

Where GWAS-2D Runs Today

The paper validates NGG on Arabidopsis thaliana, a model plant with a well-characterized genome. The technology is species-agnostic, and Bionomeex has since deployed it in three separate domains.

Human disease genetics. In collaboration with CHU de Bordeaux and CHU Sainte-Justine in Montreal, NGG runs on complex diseases where individual variant analysis has reached its explanatory ceiling metabolic disorders, neurological conditions, inherited disease. The target is the epistatic combinations that drive predisposition and progression.

Crop breeding. In collaboration with HZPC, one of the world's leading potato breeding companies, and Michigan State University, GWAS-2D maps genetic combinations linked to drought resistance and yield. Instead of years of phenotypic selection, breeders get a computational map of the trait's genetic architecture.

Forest adaptation. In collaboration with CEFE (Centre d'Écologie Fonctionnelle et Évolutive, Montpellier), NGG runs on beech tree populations to identify which genetic combinations produce climate resilience. Datasets reach 450,000 variants and approximately one terabyte of interaction data per phenotype.

→ 2D-GWAS in Potato Breeding — HZPC
→ 2D-GWAS in Beech Trees — CEFE
→ 2D-GWAS in Complex Disease — CHU Bordeaux

What Changes With NGG and What Stays the Same

NGG works from the same input as standard GWAS: a SNP matrix and measured phenotypic traits. No new experimental protocol. No additional data collection.

The model changes. Instead of testing each variant individually, NGG builds and solves a compressed version of the full pairwise interaction matrix estimating effect sizes for every SNP and every SNP combination simultaneously, without pre-selection and without multiple testing correction.

The output is different in kind, not just in scale. Researchers get complete epistatic maps rather than ranked variant lists a view of genetic architecture that variant-by-variant analysis structurally cannot produce. Downstream, this feeds machine learning models for phenotypic prediction, guides breeding programs toward interaction markers rather than individual variants, and opens biological hypotheses that pre-selection-based epistasis research forecloses by design.

Frequently Asked Questions

What is the difference between GWAS and Next-Gen GWAS?
Standard GWAS tests variants one at a time for association with a phenotype. NGG maps pairwise interactions between all variants simultaneously over 80 billion combinations without pre-selecting candidates. It recovers a portion of the missing heritability that additive models cannot access, and it improves phenotypic prediction when combined with standard machine learning.

What is compressed sensing and why does it apply to genomics?
Compressed sensing reconstructs sparse signals from far fewer measurements than classical methods require. For a given phenotype, most of the 80 billion SNP interactions have zero effect the true signals are sparse. Bionomeex exploited this sparsity to compress the epistatic problem mathematically, enabling GPU-accelerated computation at a scale that brute-force approaches cannot reach.

What is LuXiol?
LuXiol is the visualization software Bionomeex developed to navigate GWAS-2D output. Each full epistatic map is approximately 500 gigabytes. Luciol organizes results in zoomable layers from a genome-wide overview down to individual SNP pairs functioning as a geographic information system for the interaction space. It was co-funded by BPI France.

How does NGG handle population structure correction?
NGG does not use a kinship matrix explicitly, but the compressed sensing resolution incorporates a compressed version of the kinship matrix estimation. This appears to perform population structure correction comparable to mixed models an observation the paper documents but flags as requiring further validation across diverse genetic architectures.

What species can GWAS-2D be applied to?
The technology is species-agnostic. Validated on Arabidopsis thaliana, it has been applied to potato, beech trees, and human disease datasets. The primary requirements are a genotype matrix with sufficient individuals and measured phenotypic traits. Performance scales with individual count larger populations give higher epistatic signal detection power.

CONCLUSION :
Missing heritability is not a data problem. Sequencing is cheap. The problem is mathematical the result of a field analyzing genes in isolation when biology runs on interactions. NGG addresses this directly: compressed sensing, GPU acceleration, and a complete epistatic map where there was previously a blank.

The paper ran on Arabidopsis. The technology now runs in two university hospitals, a potato breeding company, and a forest ecology laboratory in Montpellier. If your research has hit the ceiling that standard GWAS puts on complex trait analysis.

→ Contact Bionomeex

‹ AI Pest Detection in Vineyards: SITEVI 2025"

Blog

Montpellier Genomics: Where AI Meets Genetic ›