Data and genotypes

We have already produced input pangenome VCFs for several datasets from high-quality, haplotype-resolved assemblies that can be used as input to PanGenie. These files were used to produce genotyping results for the HGSVC and HPRC projects. Genotypes for 3,202 samples from the 1000 Genomes Project produced based on these VCFs are also linked below.

Note: results produced by different versions of PanGenie are not directly comparable, since newer versions of PanGenie produce more accurate genotyping results.

PanGenie v1.0.0

Dataset

PanGenie input VCF

Callset VCF

1000G Genotypes (n=3,202)

HGSVC-GRCh38 (freeze3, 64 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v1.0.0)

HGSVC-GRCh38 (freeze4, 64 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v1.0.0)

HPRC-GRCh38 (88 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v1.0.0)

related publications:

Ebert, P., Audano, P.A., Zhu, Q., Rodriguez-Martin, B., Porubsky, D., Bonder, M.J.,
Sulovari, A., Ebler, J. et al.
Haplotype-resolved diverse human genomes and integrated analysis of structural variation
Science, 372(6537), 2022
Liao W.-W., Asri M., Ebler J., Doerr D., Haukness M., Hickey G., Lu S., Lucas J. K.,
Monlong J., Abel H. J., et al.
A draft human pangenome reference
Nature, 617(7960), 2023

PanGenie v2.1.1

Dataset

PanGenie input VCF

Callset VCF

1000G Genotypes (n=3,202)

HPRC-CHM13 (88 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v2.1.1)

PanGenie v3.1.0

Dataset

PanGenie input VCF

Callset VCF

1000G Genotypes (n=3,202)

HGSVC3+HPRC-CHM13 (214 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v3.1.0)

related publication:

Logsdon, G. A., Ebert, P., Audano, P. A., Loftus, M., Porubsky, D., Ebler, J., et al.
Complex genetic variation in nearly complete human genomes
Nature, 644(8076), 2025

PanGenie v4.2.1

Dataset

PanGenie input VCF

Callset VCF

1000G Genotypes (n=3,202)

HPRC2-CHM13 (462 haplotypes)

bubble-VCF

callset-VCF

1000G-VCF (PanGenie v4.2.1)

HPRC2-GRCh38 (462 haplotypes)

bubble-VCF

callset-VCF

not available

In all cases, the bubble-VCFs provided in the second column were given as input to PanGenie. The callset-VCFs (third column) were used to convert the genotyped VCFs into a biallelic, callset representation. We show the exact commands to be used below:

PanGenie-index -v <bubble-VCF> -r <reference.fa> -t <number of threads> -o <indexing-prefix>
PanGenie -f <indexing-prefix> -i <reads.fa/fq>  -s <sample-name> -j <nr threads kmer-counting> -t <nr threads genotyping> -o <genotyping-prefix>
cat <genotyping-prefix>_genotyping.vcf | python3 convert-to-biallelic.py <callset-VCF> > <genotyping-prefix>_genotyping_biallelic.vcf

The script convert-to-biallelic.py can be found here.