Curated Datasets Useful for Building Commons Management Tools
Administrative boundaries determine jurisdiction for policies and legislation. They also mean control of resources, activities, and development. We have compiled a dataset of the Indian administrative boundaries- at the state, district, gram panchayat, block, village, and agro-climatic zone levels.
Some administrative boundaries change between census years due to the creation or merging of different states/districts/villages, etc., over time. We have mapped these changes across the years 2001 and 2011.
India State Boundary: https://drive.google.com/drive/folders/1vhxQq3c1E4VyBCpCN0fFFxDXZUzIVsN_?usp=share_link
India District Boundary: https://drive.google.com/drive/folders/1QeZ6v2Uv6sDNv2Uzeyzc7T06Y4DrfMjn?usp=share_link
India Gram Panchayat Boundary: https://drive.google.com/drive/folders/1BMyDXwTmQf3lamwcGG78YnB8DwFbySr8?usp=sharing
India Village Boundary: https://drive.google.com/drive/folders/1yOe055tlSCQHEf2dCI4nL7QLTo58urmv?usp=sharing
India Agro-climatic Zones: https://drive.google.com/drive/folders/1SVqlbjS3QAof_kxZkQzgI-ljFEvdmaCf?usp=share_link
India Block Boundaries: https://drive.google.com/drive/folders/15J_jPhIO0zG_kzF7JR2Mta1tzE_mPSnu?usp=share_link
Land Use and Land Cover
Land use and land cover classification is the categorization of human activities and natural elements on the landscape within a specific time frame based on scientific and statistical methods of analysis of remote sensing data. Our goal is to provide reproducible classification outputs using open-source satellite data and publicly available code that can be executed on commodity platforms like Google Earth Engine.
NREGA geotagged assets data: The ISRO has undertaken the task of geotagging the NREGA works. The data has been made publicly available through Bhuvan portal. It provides the basic details of works such as work code, asset ID etc. but doesn’t provide other essential metadata such as work type, expenditure, number of persondays needed to accomplish the work. This metadata has been obtained from the NREGA MIS website which is managed by the Ministry of Rural Development, Central Government of India.
NREGA panchayat dataset: The metadata of NREGA at the panchayat level such as number of job cards issued, number of active job cards, number of person days employment provided etc. for each financial year from 2014-15 to 2020-21 has been crawled from the same NREGA MIS website.
Documentation for NREGA geotagged assets: https://drive.google.com/file/d/1AfYesx2eAFbmkyrbHMnEBJBBabISXSjZ/view?usp=share_link
Documentation for panchayat data: https://drive.google.com/file/d/1BWQffkyk0vmzd4vHcC5mX5zfhFBTc8cN/view?usp=share_link
Water balance in the discipline of hydrology that aims to estimate the unknown water fluxes. It is an equation which is expressed in terms of water inputs, outputs and storage in a watershed. We intend to estimate the net change in groundwater on a fortnightly basis for a micro watershed by solving the water balance equation. Water balance equation will take precipitation, runoff, evapotranspiration and change in soil moisture as inputs to output change in groundwater. Each of these water balance inputs are derived using remote sensing products in order to diagnose the groundwater state of a micro watershed. The groundwater states can be as follows: Safe, Semi-critical, Critical and Over Exploited as per the Central Ground Water Board (CGWB) by government of India. With the current groundwater state in hand, the objective is to improve the groundwater state through following interventions:
- Change in cropping patterns
- Construction of rainwater harvesting structures
- Building plantations
The interventions will affect different components of the water balance equation and will allow us to project the groundwater state in future. For example, change in cropping patterns will affect evapotranspiration in the water balance equation. Groundwater projections will facilitate scientific and participatory planning within the community.
Hydrologic soil groups: Global Hydrologic Soil Groups (HYSOGs250m) for Curve Number-Based Runoff Modeling (ornl.gov)
Land use land cover classification: Bansal, C., Ahlawat, H. O., Jain, M., Prakash, O., Mehta, S. A., Singh, D., … & Seth, A. (2021, June). IndiaSat: A Pixel-Level Dataset for Land-Cover Classification on Three Satellite Systems-Landsat-7, Landsat-8, and Sentinel-2. In ACM SIGCAS Conference on Computing and Sustainable Societies (pp. 147-155).
The above mentioned datasets can be provided at a block level upon request
Just like Human Development Index (HDI) is a method to build an aggregate index for development by giving equal weightage to indicators for economic development (per capita GDP), education (literacy rate), and health (life expectancy). We similarly build an aggregate development index (ADI) based on different socio-economic indicators that we predict using remote-sensing data. This index can be computed at a higher spatio-temporal resolution and can depict the development status of a district/village that will allow targeted policy-making initiatives.
In the csv file linked below, columns ADI_2001, ADI_2011, and ADI_2019 hold the ADI value at village level. Blank entries in these cells would mean missing shapefile or remote sensing data due to which that value could not be computed.
These databases are community-generated and not available in the open. Please contact us for access.
The dataset comprises question-answer pairs, in which questions sharing the same answer are grouped to represent various ways of asking the same question. It is categorized into broad themes, enabling more efficient answer retrieval when theme information is available.
- Caller Query: This is the link to the audio file for the question that is asked by the user.
- Caller Query Transcription: This is the text version of the original caller query after the transcription.
- Relevant Question: Many callers include personal information and other irrelevant parts in their question. All additional information is removed from these questions to create the relevant question. This is the column on which the model is trained and tested.
- Sanitized Question: A refined version of the question is manually created so that different relevant questions can be mapped to the corresponding sanitized questions.
- Theme and Sub-Theme: All Caller query transcriptions are tagged with a corresponding theme and sub-theme by the caller.
Purpose of the data collection:
A challenge on Automatic Speech Recognition for Hindi was organized as part of INTERSPEECH 2022 by sharing the spontaneous telephone speech recordings collected by Gram Vaani. The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with varying degrees of accuracy due to crowd sourcing make it a unique corpus for automatic recognition of spontaneous telephone speech.
Recent advancements in Speech technology have shown that ASR systems can work at par with humans. To build a good ASR system requires large amounts of training data and high-end computational resources. However, when it comes to Indian languages, not everyone, especially academic institutions and startups, have access to these resources. As a part of this challenge, telephone quality speech data in Hindi was released. Everyone who participated in this challenge was then free to use this data for research purposes.
1. Train set – 100 hours (labeled)
2. Development set – 5 hours (labeled)
3. 1000 hours of unlabelled data
A permission for commercial use of the dataset must be sought through a data sharing agreement signed with Gram Vaani.