Curated Datasets Useful for Building Commons Management Tools

GIS based planning for natural resource management

GIS-based planning refers to the planning process pertaining to the placement of NRM assets within a geographical region using GIS tools like Quantum GIS (QGIS), Google Earth Engine, and the like. The process facilitates the creation of new assets in a scientific and equitable manner and can also be used to diagnose the failure of existing assets. The datasets listed below can be used for GIS-based planning for a region by overlaying the layers corresponding to these datasets in a GIS tool.

For scientific planning of assets within some administrative boundary, the respective administrative boundary can be imported as a layer, along with layers like drainage lines, slope, elevation, lineaments, soil type, precipitation and so on. The location of the proposed assets can then be marked, and a feasibility assessment of the proposed assets can be done. The same process can be followed to diagnose the failure of an existing asset (e.g., detecting the presence of lineament below a water structure with low water retention). From the perspective of equitable planning, the layers corresponding to existing NRM assets and socio-economic development can be imported. The location of proposed assets can then be decided with an objective to ensure equity in terms of access to the assets for the different socio-economic groups. 

Guide to import these datasets into QGIS or GEE:

Open Datasets

Admin Boundaries

Administrative boundaries determine jurisdiction for policies and legislation. They also mean control of resources, activities, and development. We have compiled a dataset of the Indian administrative boundaries- at the state, district, gram panchayat, block, village, and agro-climatic zone levels.

Some administrative boundaries change between census years due to the creation or merging of different states/districts/villages, etc., over time. We have mapped these changes across the years 2001 and 2011.

Land Use and Land Cover

Land use and land cover classification is the categorization of human activities and natural elements on the landscape within a specific time frame based on scientific and statistical methods of analysis of remote sensing data. Our goal is to provide reproducible classification outputs using open-source satellite data and publicly available code that can be executed on commodity platforms like Google Earth Engine.

Dataset documentation

Coming Soon


NREGA geotagged assets data: The ISRO has undertaken the task of geotagging the NREGA works. The data has been made publicly available through Bhuvan portal. It provides the basic details of works such as work code, asset ID etc. but doesn’t provide other essential metadata such as work type, expenditure, number of persondays needed to accomplish the work. This metadata has been obtained from the NREGA MIS website which is managed by the Ministry of Rural Development, Central Government of India.

NREGA panchayat dataset: The metadata of NREGA at the panchayat level such as number of job cards issued, number of active job cards, number of person days employment provided etc. for each financial year from 2014-15 to 2020-21 has been crawled from the same NREGA MIS website.

Hydrological Variables

Water balance in the discipline of  hydrology that aims to estimate the unknown water fluxes. It is an equation which is expressed in terms of water inputs, outputs and storage in a watershed. We intend to estimate the net change in groundwater on a fortnightly basis for a micro watershed by solving the water balance equation. Water balance equation will take precipitation,  runoff, evapotranspiration and change in soil moisture as inputs to output change in groundwater. Each of these water balance inputs are derived using remote sensing products in order to diagnose the groundwater state of a micro watershed. The  groundwater states can be as follows: Safe, Semi-critical, Critical and Over Exploited as per the Central Ground Water Board (CGWB) by government of India. With the current groundwater state in hand, the objective is to improve the groundwater state through following interventions:

  • Change in cropping patterns
  • Construction of rainwater harvesting structures
  • Building plantations

The interventions will affect different components of the water balance equation and will allow us to project the groundwater state in future. For example, change in cropping patterns will affect evapotranspiration in the water balance equation. Groundwater projections will facilitate scientific and participatory planning within the community.

Dataset Documentation

Precipitation: GSMaP Operational: Global Satellite Mapping of Precipitation  |  Earth Engine Data Catalog  |  Google Developers

Evapotranspiration: GES DISC Dataset: FLDAS Noah Land Surface Model L4 Central Asia Daily 0.01 x 0.01 degree (FLDAS_NOAH001_G_CA_D 001) (

Soil moisture: SMAP/Sentinel-1 L2 Radiometer/Radar 30-Second Scene 3 km EASE-Grid Soil Moisture, Version 3 | National Snow and Ice Data Center (

Hydrologic soil groups: Global Hydrologic Soil Groups (HYSOGs250m) for Curve Number-Based Runoff Modeling (

Land use land cover classification: Bansal, C., Ahlawat, H. O., Jain, M., Prakash, O., Mehta, S. A., Singh, D., … & Seth, A. (2021, June). IndiaSat: A Pixel-Level Dataset for Land-Cover Classification on Three Satellite Systems-Landsat-7, Landsat-8, and Sentinel-2. In ACM SIGCAS Conference on Computing and Sustainable Societies (pp. 147-155).


The above mentioned datasets can be provided at a block level upon request

Socio-Economic Development

Just like Human Development Index (HDI) is a method to build an aggregate index for development by giving equal weightage to indicators for economic development (per capita GDP), education (literacy rate), and health (life expectancy). We similarly build an aggregate development index (ADI) based on different socio-economic indicators that we predict using remote-sensing data. This index can be computed at a higher spatio-temporal resolution and can depict the development status of a district/village that will allow targeted policy-making initiatives.

Dataset Documentation

In the csv file linked below, columns ADI_2001, ADI_2011, and ADI_2019 hold the ADI value at village level. Blank entries in these cells would mean missing shapefile or remote sensing data due to which that value could not be computed.

Closed Datasets

These databases are community-generated and not available in the open. Please contact us for access.

Question-Answering Dataset

The dataset comprises question-answer pairs, in which questions sharing the same answer are grouped to represent various ways of asking the same question. It is categorized into broad themes, enabling more efficient answer retrieval when theme information is available.

Important Columns:

  • Caller Query: This is the link to the audio file for the question that is asked by the user.
  • Caller Query Transcription: This is the text version of the original caller query after the transcription.
  • Relevant Question: Many callers include personal information and other irrelevant parts in their question. All additional information is removed from these questions to create the relevant question. This is the column on which the model is trained and tested.
  • Sanitized Question: A refined version of the question is manually created so that different relevant questions can be mapped to the corresponding sanitized questions.
  • Theme and Sub-Theme: All Caller query transcriptions are tagged with a corresponding theme and sub-theme by the caller.

Purpose of the data collection:

A challenge on Automatic Speech Recognition for Hindi was organized as part of INTERSPEECH 2022 by sharing the spontaneous telephone speech recordings collected by Gram Vaani. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech.

Recent advancements in Speech technology have shown that ASR systems can work at par with humans. To build a good ASR system requires large amounts of training data and high-end computational resources. However, when it comes to Indian languages, not everyone, especially academic institutions and startups, have access to these resources. As a part of this challenge, telephone quality speech data in Hindi was released. Everyone who participated in this challenge was then free to use this data for research purposes.

1. Train set – 100 hours (labeled)

2. Development set – 5 hours (labeled)

3. 1000 hours of unlabelled data

A permission for commercial use of the dataset must be sought through a data sharing agreement signed with Gram Vaani.