GI-DB Documentation

This page describes the Genome India Database (GI-DB): the underlying project, data production and pipeline, variant annotation, population structure, and how to use the browser and API.

About GI-DB

GI-DB (Genome India Database) is a public resource that aggregates and serves allele frequencies and annotations for genetic variants from the Genome India Project. It provides a searchable catalogue of variants across the Indian population, with population-specific and overall frequencies, functional annotations, and links to external databases. The resource is intended for researchers and clinicians interested in population genetics, rare variant interpretation, and precision medicine in the Indian context.

The Genome India Project

The Genome India project is a national initiative funded by the Department of Biotechnology (DBT), Government of India, launched in January 2020. Its goal is to sequence genomes from healthy Indian individuals representing diverse population groups across the country.

Sample design

Target cohort: Healthy individuals from multiple states, language families, and ethnic groups, including tribal and caste populations.
Geographic coverage: 22 states; 15 major language families.
Scale: Thousands of whole-genome samples contributing to the variant catalogue (e.g. ~9,768 samples in the current release).

This design ensures the database captures genetic diversity representative of India and supports the identification of population-specific and rare variants.

Data production and pipeline

GI-DB follows best practices for variant calling and quality control, analogous to those used in large-scale resources such as gnomAD.

Reference genome and pipeline

All data are aligned and called against the GRCh38/hg38 reference genome. Processing is performed using the DRAGEN (Dynamic Read Analysis for GENomics) pipeline for alignment, duplicate marking, and variant calling. Both single-nucleotide variants (SNVs) and short indels are included, with allele counts and frequencies computed across the full cohort.

Quality control

Sample QC: Samples are filtered using kinship analysis to exclude related individuals, along with metrics such as call rate and contamination, so that the final cohort is high quality and suitable for frequency estimation.
Variant QC: Variants may be filtered by depth, genotype quality, and call rate so that frequency estimates are reliable.
Annotation: Variants are annotated for functional consequence, population frequency, and other fields used in the browser and API.

Variant annotation

Each variant in GI-DB is annotated to support interpretation and filtering.

Identifiers and location

Chromosome, position, reference, alternate: Genomic coordinate and alleles (e.g. chr, pos, ref, alt).
Variant ID: Internal identifier (e.g. GIDB_ID) and, where available, dbSNP rsID.

Functional annotation

Variants are annotated with predicted functional consequence (e.g. synonymous, missense, loss-of-function) and gene/transcript context. Consequence types in the database include, among others:

Intergenic and intron variants
Upstream/downstream gene variants
Synonymous and missense variants
3′ and 5′ UTR variants
Splice region, stop-gained, and frameshift variants

Frequency and counts

For each variant, the database stores:

Allele count (AC) / allele number (AN): Count of alternate alleles and total alleles in the cohort.
Allele frequency (AF): Proportion of alternate alleles (overall and optionally by population).
Sample counts: Number of samples with the variant (e.g. NS) and homozygote counts where applicable.

These fields are shown in the variant pages and are available via the API for gene and region queries.

Population structure

The cohort is structured into 83 population groups, reflecting geographic, linguistic, and ethnic diversity in India. Frequencies can be aggregated overall or by population, enabling:

Discovery of population-specific variants
More accurate assessment of rare variants in specific groups
Research on population structure and admixture within India

Population labels and sample sizes are described in the project publications and may be summarized in the browser or in downloadable metadata.

Database statistics

Summary statistics for the current release are available on the Stats page and give an overview of the resource scale.

Variant counts

Total variants: On the order of ~130 million variants in the full catalogue.
Chromosome distribution: Variant counts per chromosome (chr1–chr22) are shown in the Stats charts.

Functional distribution

Rough distribution of variant consequences (e.g. intergenic, intronic, missense, synonymous, etc.) and exonic variant types (synonymous, missense, nonsense, frameshift, etc.) are provided to illustrate the composition of the dataset.

Exact numbers may be updated with new releases; refer to the Stats page and publication for current figures.

Using the browser

The GI-DB website allows you to search and explore variants by gene, variant, region, or rsID.

Search types

Query type	Example	Description
Gene	`BRCA1`	Returns variants overlapping the gene (by symbol).
Variant	`chr7:117504290-C-T`	Exact variant by chromosome, position, ref, alt.
Region	`chr22:23727262-23777262`	All variants in the given genomic interval.
rsID	`rs1000000`	Variant by dbSNP identifier.

Variant page

After searching, you can open a variant to see:

Genomic position, alleles, and identifiers
Allele frequency in Genome India and, where available, global frequency
Functional annotation and gene context
Links to external resources (e.g. dbSNP, gnomAD, ClinVar, Ensembl, UCSC)

Data access and citation

API

Programmatic access is provided via the GI-DB API. Supported query types include:

Gene: Fetch variants by gene symbol(s).
Location: Fetch variants by genomic region(s).

Responses include variant identifiers, coordinates, allele frequencies, and annotations. See the API documentation for endpoints, parameters, and rate limits.

Citation

The flagship manuscript describing the Genome India cohort, pipeline, and variant catalogue will be published soon. Once available, please cite that publication when using GI-DB or Genome India data.

For the database and web resource, please acknowledge: GI-DB – Genome India Database (https://gidb.igib.res.in / maintained by CSIR-IGIB).