1 How ERA Connects to Geospatial Climate and Soils Data
Intended Users
This documentation is intended for technical users working with the ERA meta-dataset who wish to integrate seasonally relevant climate statistics into agronomic observations. Users do not need to rerun the calculations—preprocessed climate data are provided on S3—but may use this guide to understand:
What climate indicators were generated
How planting dates and season lengths were determined
Where to find the data and how to merge them with ERA observations
Where to find the code used to generate and process data
Background
We developed a geospatial enrichment pipeline to augment ERA’s agronomic experiments with high-resolution climate, soil, and elevation data, linked to specific crops, locations, and growing seasons. Each observation is connected to daily weather time series and soil attributes based on its site coordinates and reported or inferred planting and harvest dates. Where precise dates are unavailable, the pipeline uses a tiered imputation approach—drawing on published planting windows, nearby analogs, and agroclimatic indicators such as rainfall onset—to estimate a plausible growing season. This enables the calculation of detailed climate statistics for the period most relevant to crop development, while excluding records with excessive spatial or temporal uncertainty.
The enrichment process applies only to crop-based experiments. Climate statistics are generated only where both spatial and temporal resolution meet defined quality thresholds—specifically, where the site location is known within 50 km and the cropping calendar can be clearly determined. Records from animal feed experiments, as well as spatially or temporally aggregated data (e.g., regional summaries or multi-year averages), are not included. As a result, only a subset of ERA observations receive climate enrichment—those with sufficient detail to anchor the analysis in a specific place and season.
1.1 Data Sources
The ERA pipeline enriches observations with climate, soil, and landscape data using custom functions stored in:
Soil data are used to estimate key properties like water-holding capacity, which underpin the calculation of climate indicators such as Eratio and waterlogging. Two soil datasets are used depending on site location:
Use: Used for all African sites in ERA. Offers high-resolution predictions of soil texture, carbon, pH, and depth—well-suited to the diversity of African agroecosystems.
The generate_climate_stats.R pipeline constructs crop-specific seasonal windows and computes derived climate indicators for each observation in the ERA agronomy dataset. These indicators are designed to reflect climate conditions experienced during the growing season, rather than general climatological conditions.
Each observation is linked to a custom seasonal window based on:
Reported Planting and Harvest Dates: If available, these dates define the crop’s growing period directly.
Imputed Dates: Where planting or harvest dates are missing or uncertain, the pipeline estimates plausible values using:
Nearby observations (within 1–10 km)
Published planting calendars
Agroclimatic thresholds (e.g. start of rainy season based on dekadal CHIRPS rainfall)
Season Length Estimation: Season length is either taken from the original dataset, imputed from nearby records, or inferred from EcoCrop definitions of crop cycle duration.
Alternate Windows: In addition to the main growing period, alternate windows are used for specific purposes:
PDate.SLen.EcoCrop: uses EcoCrop-inferred season length
PDate.SLen.P30: fixed 30-day window after planting (used to assess early-season climate stress)
These windows allow climate statistics to be calculated only for periods relevant to crop development, improving interpretation compared to annual or calendar-based averages.
Climate Statistics Generated
Unlike the foundational datasets (e.g., daily rainfall, temperature, radiation), which provide raw gridded values, this pipeline produces seasonally aggregated statistics aligned with cropping windows. These include:
Temperature: Mean, max, min, and variability of daily temperatures; heat stress indicators (e.g., number of days >35°C).
Rainfall: Total rainfall, dry spell frequency, rainfall adequacy.
Growing Degree Days (GDD): Thermal accumulation across sub-optimal, optimal, and heat-stressed temperature bands.
Evaporative Ratio (ERatio): Daily ratio of actual to potential evapotranspiration — a proxy for drought stress.
Dry Spells: Frequency, length, and timing of low-rainfall periods.
Each of these indicators is calculated per site–season–crop combination using the daily CHIRPS and POWER datasets and simulated water balance (see water_balance.R).
1.5 These derived indicators provide a biophysically relevant summary of climate exposure tailored to the actual growing period of each crop, making them more actionable than raw daily data or long-term averages.
1.5.1 Downloading the climate data
To access the climate statistics generated for ERA observations, download the harmonized .RData file from the geodata directory on S3:
Content: This file contains daily and seasonal climate summaries per site, ready to be joined with ERA observations.
You can download the file using the s3fs interface as follows:
# Set the remote S3 path and local save paths3_data_dir <-"s3://digital-atlas/era/geodata"local_data_dir <-"downloaded_data"# List and filter filess3<-s3fs::S3FileSystem$new(anonymous = T)files_s3 <- s3$dir_ls(s3_data_dir)files_s3 <-grep("clim_stats.*RData", files_s3, value =TRUE)(files_s3 <-tail(files_s3, 1))
site_data: contains the spatial and temporal location data for which climate statistic are generated.
PDate.SLen.Data, PDate.SLen.EcoCrop,PDate.SLen.P30: these objects are lists of output climate data calculated for for different parameterizations of season length.
1.6.1 Unique locations and times (clim_data$site_data)
site_data contains the unique combinations of site, time, crop, planting date, and harvest date from the ERA agronomy dataset.
Site.Key: A unique identifier for each site or location. It is used to link locations consistently across datasets.
Code: A unique code used to identify a publication or entry in the ERA dataset. It serves as the main key for tracking a specific experiment/publication across associated tables.
M.Year: Measurement year – a code that identifies the production season, typically aligned with the Time field in the main ERA dataset. This may take the form of a calendar year or include other formatting to distinguish multiple seasons per year.
`Latitude: Geographic latitude of the site in decimal degrees (WGS84). Used for spatial analyses and mapping.
Longitude: Geographic longitude of the site in decimal degrees (WGS84). Used for spatial analyses and mapping.
M.Year.Code: A standardized or formatted version of M.Year, often combining year and season. Useful for indexing and subsetting.
M.Season: Management season (typically 1 or 2) indicating the cropping season within a year. May be NA in unimodal systems; helps distinguish multiple cropping events in bimodal climates.
1.6.1.2 Crops
These fields contain thresholds that define a crop’s temperature response curve and come from EcoCrop. They can also be used to calculate growing degree days, stress indices, or suitability zones under historical or future climate conditions.
Product: The name of the crop or agricultural product (e.g., maize, beans) associated with the management and outcome data.
EU: Experimental Unit code links to the era_master_codes$EU table.
Tlow: The minimum temperature threshold for crop development. Below this value, crop growth is assumed to be negligible or halted. Often derived from EcoCrop or agronomic sources.
Thigh: The maximum temperature threshold for crop development. Temperatures above this can lead to heat stress or failure in development.
Topt.low: The lower bound of the optimal temperature range for crop growth. Within this and Topt.high, the crop achieves near-optimal physiological performance.
Topt.high: The upper bound of the optimal temperature range for crop growth. Growth efficiency typically declines beyond this value, even if not fully stressed.
These thresholds define a crop’s temperature response curve and come from EcoCrop. They can also be used to calculate growing degree days, stress indices, or suitability zones under historical or future climate conditions.
1.6.1.3 Planting dates
site_data contains information about planting dates and their estimation:
Plant.Start: The reported start date for planting. This indicates when the planting period began according to the original data.
Plant.End: The reported end date for planting. This marks the conclusion of the planting period in the original dataset.
Plant.Diff.Raw: The difference (in days) between Plant.Start and Plant.End—indicating how uncertain the reported planting window was.
Data.PS.Date: The estimated start date for planting, inferred from nearby or similar observations in ERA when a reported planting date is missing or uncertain.
Data.PE.Date: The estimated end date for planting, derived using the same method as Data.PS.Date to define a plausible planting window.
SOS: The estimated Start of Season date, derived from daily climate data using agroclimatic thresholds (e.g. rainfall ≥25 mm in a dekad and ≥20 mm in the following two dekads, with aridity index AI ≥ 0.5). It marks when planting conditions were first met based on climatic signals.
P.Date.Merge: The final, merged planting date calculated by the pipeline. It represents a consolidated planting date that may incorporate adjustments or estimations (for example, averaging the planting window or refining it using rainfall data). It should be interpreted as the number of days since 1900-01-01.
P.Date.Merge.Source: A descriptive label indicating the source or method used to derive the merged planting date. This might indicate whether the date was taken directly from published data (e.g., “Published”) or estimated using spatial or rainfall data (e.g., “Nearby 1km”, “SOS + Published”, etc.).
Values below are presented in order of preference when estimating planting date in the P.Date.Merge field:
Published: The planting date was directly reported in the original study with no need for estimation.
Published CHIRPS: A published planting date was available but was refined or verified using CHIRPS rainfall data.
Nearby 1km CHIRPS : The estimation was based on observations from locations within a 1‑km radius, with additional refinement using CHIRPS data.
Nearby 10km CHIRPS: As with the 10‑km estimation, this method further incorporated CHIRPS rainfall data to improve the estimate.
Nearby 1km: Similar to the CHIRPS-based 1‑km estimate but without the additional rainfall data refinement.
Nearby 10km: The planting date was estimated from nearby observations aggregated over a 10‑km radius due to missing or uncertain reported dates.
SOS + Published: The planting date was adjusted using SOS information in cases where the published date was uncertain, without incorporating CHIRPS data.
SOS + Published CHIRPS: When the reported planting date (Published) was too uncertain, the method adjusted it using the Start‐of‐Season (SOS) rainfall onset data alongside CHIRPS information.
This hierarchy reflects a logical preference: Directly observed data > Nearby analogues > Climatological estimation.
1.6.1.4 Season length
site_data contains information about reported harvest dates and season length. Season length may use the reported dates or be estimated.
Harvest.Start: The reported or estimated date when harvest began. Typically reflects the first day of the harvest window.
Harvest.End: The reported or estimated date when harvest concluded. Typically reflects the last day of the harvest window.
SLen: Season Length – calculated as the number of days between Plant.Start and Harvest.End. Represents the observed or estimated duration of the cropping cycle.
Data.SLen: Season Length derived from reported data only (i.e., Plant.Start and Harvest.End must both be available from original records). Used to indicate where the season length is based on direct evidence rather than estimates.
SLen.EcoCrop: An estimate of cropping cycle length derived from the EcoCrop database refined using data available in ERA where possible. Used as a fallback when data-derived values are missing. SeasonLength.EcoCrop is redundant and contains the same information as SLen.EcoCrop.
SLen.Source: This field indicates how the final Season Length (SeasonLength.Data field) used in calculations was derived, based on the origin of planting and harvest date estimates. The format is:<Planting Source> + <Season Length Source>.
SeasonLength.Data: Combines SLen and Data.SLen fields, substituting values Data.SLen when SLen is NA.
The format of SLen.Source is <Planting Source> + <Season Length Source> and the order of preference for the season length source is the same as for planting. Observed values include:
- Published + Pub – Both planting and harvest dates are reported with low uncertainty in the publication.
- Published + Nearby 1km – Planting date reported with low uncertainty; season length estimated from nearby (within 1 km) observations.
- CHIRPS Published + Pub – Planting date reported, but uncertain, and refined using CHIRPS rainfall; harvest dates reported with low uncertainty.
- Nearby 1km + Nearby 1km – Both planting date and season length derived from nearby (within 1 km) observations.
- Nearby 1km + Nearby 10km – Planting date from 1 km radius; season length from 10 km radius.
- SOS + Published + Nearby 1km –The planting date was adjusted using SOS information in cases where the published date was uncertain, without incorporating CHIRPS data; season length from nearby data.
- CHIRPS SOS + Published + Pub – When the reported planting date (Published) was too uncertain, the method adjusted it using the Start‐of‐Season (SOS) rainfall onset data alongside CHIRPS information; harvest dates reported with low uncertainty.
- Published + Nearby 10km – Planting date reported with low uncertainty; season length from 10 km proximity.
- Nearby 1km + Pub – Planting data from nearby; harvest dates reported with low uncertainty.
- Nearby 10km + Nearby 1km – Planting data from nearby;season length from 10 km proximity.
- NA – No season length estimate was available or derived.
These combinations trace the logical fallback and merging sequence for generating season length when direct data are missing or uncertain.
These can be merged with ERA observation data using the Site.ID and Time fields.
1.6.2 Climate data (PDate.SLen.Data, PDate.SLen.EcoCrop, PDate.SLen.P30)
Each of these climate window datasets contains a set of summary tables—one per climate indicator (e.g., temperature, rainfall, GDD)—with statistics calculated over the defined seasonal window for every crop-site-season combination that passed quality filters.
PDate.SLen.Data : site_data$P.Date.Merge and site_data$SeasonLength.Data are used to determine the start and end dates within which climate statistics are calculated. If season length is not reported or cannot be inferred from ERA data for a row in site_data then no climate stats will be generated for that record.
PDate.SLen.EcoCropsite_data$P.Date.Merge and site_data$SLen.EcoCrop are used to determine the start and end dates within which climate statistics are calculated. Season length is inferred from the midpoint of ecocrop cycle length for a crop, refined where possible using reported values within the ERA dataset. This dataset therefore inputes missing season length and contains more records than PDate.SLen.Data,however season length is likely to be less accurate.
PDate.SLen.P30site_data$P.Date.Merge is used to determine the start date of the climate window, and the end date is fixed to 30 days after planting. This represent the post-planting climate, which can be a particularly sensitive period for many crops.
Each of the following names corresponds to a list of climate statistics calculated over the seasonal window defined by P.Date.Merge and SeasonLength.Data:
gdd: Growing Degree Days — cumulative heat units over the season binned into thermal stress classes, useful for crop development and heat stress exposure tracking.
temperature: Mean, minimum, and maximum temperatures over the season. Consecutive and total days above/below temperature thresholds.
rainfall: Total and average precipitation during the season. Consecutive and total days above/below precipitation thresholds.
eratio: Ratio of rainfall to reference evapotranspiration — a proxy for water availability or drought stress.
logging: Days with waterlogging risk — based on rainfall thresholds that may indicate excess moisture conditions.
Each object is a data.table with one row per Site.ID and columns containing summary statistics for that climate indicator.
1.6.2.1 shared fields (index or key fields)
These fields are needed for merging the climate statistics back to the ERA comparisons table.
All tables share these fields:
- Site.Key: The site identifier for spatially reconnecting to the ERA comparisons table.
- M.Year: The time period identifier for temporally reconnecting to the ERA comparisions table.
- EU: The crop or animal product code. - Product: The crop or animal product name (this corresponds to the Product.Simple name field in ther ERA comparisons table) - Plant.Start: The original planting start date (as per the ERA comparisons table raw data).
- Plant.End: The original planting end date (as per the ERA comparisons table raw data).
- Harvest.Start: The original harvest start date (as per the ERA comparisons table raw data).
- Harvest.End: The original harvest end date (as per the ERA comparisons table raw data).
Additionally these shared fields are present: - window: Description of window used, useful if merging tables that use different climate window calculation methods.
- row_index : Internal index to link this row back to the corresponding entry in the site_data table.
This table contains Growing Degree Day (GDD) statistics calculated over the defined season window for each site. Here’s what each field represents:
- gdd_subopt: Cumulative GDD within the sub-optimal temperature range for crop growth (above base temperature but below optimal).
- gdd_opt: Cumulative GDD within the optimal temperature range — where the crop is expected to grow most efficiently.
- gdd_aboveopt: Cumulative GDD in the above-optimal range, where temperatures may begin to reduce growth efficiency.
- gdd_abovemax: Cumulative GDD above the maximum threshold, indicating heat stress or potentially damaging conditions.
This table summarizes temperature-related climate statistics. Fields fall into two main categories:
1. Heat Stress Threshold Indicators (tmax_tg_*)
These fields summarize extreme high-temperature events, using thresholds of 35°C, 37.5°C, and 40°C. The same set of metrics is calculated for each threshold:
tmax_tg_[threshold].days: Total number of days where maximum temperature (Tmax) exceeded the threshold. e.g., tmax_tg_35.days = number of days > 35°C.
tmax_tg_[threshold].days_pr: Proportion of days in the season above the threshold.
tmax_tg_[threshold].max_rseq: Maximum length of any consecutive sequence of days above the threshold.
tmax_tg_[threshold].n_seq_dX: Number of sequences of at least X days where Tmax stayed above the threshold.
d5: ≥5 consecutive days.
d10: ≥10 consecutive days
d15: ≥15 consecutive days
These indicators help assess the intensity, persistence, and frequency of heat stress.
2. General Temperature Statistics
These capture broader temperature behavior during the season:
Tmin-related fields:
tmin_min: Minimum of daily minimum temperatures
tmin_mean: Mean daily minimum temperature
tmin_var: Variance of daily minimum temperatures
tmin_sd: Standard deviation of daily minimum temperatures
tmin_range: Difference between max and min daily minimum temperatures
Tmax-related fields:
tmax_max: Maximum of daily maximum temperatures
tmax_mean: Mean daily maximum temperature
tmax_var: Variance of daily maximum temperatures
tmax_sd: Standard deviation of daily maximum temperatures
tmax_range: Difference between max and min daily maximum temperatures
Tmean (daily average temperature) fields:
tmean_max: Maximum of daily mean temperatures
tmean_min: Minimum of daily mean temperatures
tmean_mean: Mean of daily mean temperatures
tmean_var: Variance of daily mean temperatures
tmean_sd: Standard deviation of daily mean temperatures
tmean_range: Difference between max and min daily mean temperatures
These metrics provide a comprehensive description of temperature variability and extremes during the growing season.
This table summarizes rainfall-related climate statistics.
1. Total and Derived Rainfall Metrics - rain_sum: Total rainfall (mm) accumulated over the observation window.
- eto_sum: Total reference evapotranspiration (mm) over the window, calculated from NASA POWER data.
- eto_na: Number of days with missing ETO values due to data unavailability.
- w_balance: Approximate seasonal water balance: rain_sum – eto_sum.
- w_balance_negdays: Number of days when daily rainfall < daily evapotranspiration (i.e., water deficit days).
2. Dry Spell Indicators (rain_l_*)
These indicators summarize dry spells using thresholds of 0.1 mm, 1 mm, and 5 mm of daily rainfall.
For each threshold:
- rain_l_[threshold].days: Total number of days below the rainfall threshold. e.g., rain_l_1.days = number of days with rainfall < 1 mm.
- rain_l_[threshold].days_pr Proportion of total days below the threshold.
- rain_l_[threshold].max_seq: Length of the longest consecutive sequence of dry days.
- rain_l_[threshold].n_seq_dX:Number of dry spells lasting at least X days:
- d5 = ≥5 consecutive days
- d10 = ≥10 consecutive days
- d15 = ≥15 consecutive days
These variables describe evaporative ratio (Eratio) statistics, which serve as a proxy for water stress during the crop season. Eratio is computed as the ratio of actual evapotranspiration (Ea) to potential evapotranspiration (Ep), based on a daily water balance simulation that accounts for rainfall, PET, and soil water-holding capacity:
Eratio = Ea / Ep
Ep (potential evapotranspiration) is calculated using the Priestley–Taylor method.
Ea is estimated by simulating daily water availability in the soil, using a simple empirical model based on soil capacity and depletion (see calc_daily_watbal() in watbal_all_in_one.R).
Soil properties (e.g., field capacity, saturation, depth) are estimated using a pedotransfer function (AWCPTF()), and aggregated with soilcap_calc().
This approach integrates soil, rainfall, and climate to better reflect actual water supply to crops, beyond rainfall alone.
Low values indicate water deficits, while higher values suggest sufficient water supply relative to atmospheric demand.
1. Summary Eratio Statistics
eratio_mean: Mean daily Eratio over the observation window.
eratio_median: Median daily Eratio.
eratio_min: Minimum daily Eratio (most severe water deficit day).
2. Water Stress Indicators (eratio_l_*)
These fields capture frequency, duration, and intensity of low Eratio events, using thresholds of <0.5, <0.25, and <0.1.
For each threshold: - eratio_l_[threshold].days: Number of days where Eratio fell below the threshold. e.g., eratio_l_0.5.days = number of days with Eratio < 0.5.
- eratio_l_[threshold].days_pr: Proportion of total days with Eratio below the threshold.
- eratio_l_[threshold].max_seq: Maximum consecutive sequence of days below the threshold.
- eratio_l_[threshold].n_seq_dX: Number of spells of at least X consecutive days below the threshold:
- d5 = ≥5 consecutive days
- d10 = ≥10 consecutive days
- d15 = ≥15 consecutive days
Thresholds represent escalating levels of water stress: - 0.5: Mild deficit
- 0.25: Moderate deficit
- 0.1: Severe deficit
These metrics can be used to identify seasonal water stress risk, evaluate drought periods, and inform adaptive irrigation or planting strategies.
These variables summarize soil waterlogging conditions during the crop season.
Waterlogging is defined here as the amount of water held in the soil above field capacity but below saturation, simulated via a daily water balance using calc_daily_watbal() from watbal_all_in_one.R.
Logging occurs when incoming rainfall exceeds the soil’s capacity to retain water at field capacity, but has not yet exceeded total saturation.
1. Summary Waterlogging Statistics
logging_sum: Total cumulative logging value across the observation window.
logging_mean: Mean daily logging value.
logging_median: Median daily logging value.
logging_present_mean: Mean logging value on days when waterlogging was present (i.e., > 0).
2. General Waterlogging Presence (logging_g_0.*)
These fields indicate periods when water balance > 0, a proxy for general waterlogging.
logging_g_0.days: Number of days where waterlogging > 0.
logging_g_0.days_pr: Proportion of days with waterlogging > 0.
logging_g_0.max_seq: Longest consecutive sequence of waterlogged days.
logging_g_0.n_seq_dX: Number of spells of X consecutive days with waterlogging:
These fields apply stricter thresholds based on soil saturation: - ssat_0.5: Moderate saturation (50% of saturation) - ssat_0.9: High saturation (90% of saturation)
For each threshold:
logging_g_ssat_[threshold].days: Number of days exceeding the saturation threshold.
logging_g_ssat_[threshold].days_pr: Proportion of season with saturation exceeded.
logging_g_ssat_[threshold].max_seq: Maximum consecutive days above threshold.
logging_g_ssat_[threshold].n_seq_dX: Number of long saturation spells:
d5: ≥5 consecutive days
d10: ≥10 consecutive days
d15: ≥15 consecutive days
These indicators help assess excess moisture risks, which can influence root health, germination success, and yields.
1.7 Connecting climate stats back to the ERA database
1.7.1 ERA Comparisons Table
# Set the remote S3 path and local save paths3_data_dir <-"s3://digital-atlas/era/data"local_data_dir <-"downloaded_data"# List and filter filess3<-s3fs::S3FileSystem$new(anonymous = T)files_s3 <- s3$dir_ls(s3_data_dir)files_s3 <-grep("compiled.*mh.*parquet", files_s3, value =TRUE)# Filter to most recent version of dataset(files_s3 <-tail(files_s3, 1))
# Create local file path and downloadfiles_local <-gsub(s3_data_dir, local_data_dir, files_s3)if(!file.exists(files_local)){ s3$file_download(files_s3, files_local)}# Load the dataera_comparisons<-arrow::read_parquet(files_local)
key_fields<-c("Site.Key","M.Year","Product.Simple","Plant.Start","Plant.End","Harvest.Start","Harvest.End")# Climate data to mergeclim_mergedat<-clim_data$PDate.SLen.EcoCrop$gdd# Rename the Product field to match the ERA comparisons tablesetnames(clim_mergedat,"Product","Product.Simple")# Remove unneeded columns clim_mergedat[,c("row_index","window","EU"):=NULL]# Remove any duplicatesclim_mergedat<-unique(clim_mergedat)# Merge datasetsera_comparisons_gdd<-merge(era_comparisons,clim_mergedat,by=key_fields,all.x=T,sort=F)# Explore merge resulthead(era_comparisons_gdd[!is.na(gdd_subopt),c(key_fields,grep("gdd",colnames(era_comparisons_gdd),value=T)),with=F])|>kable(format ="html") |>kable_styling(full_width =FALSE, bootstrap_options =c("striped", "hover"), position ="left") |>scroll_box(width ="100%", height ="250px")
Why is this less than half of the total data available? - 1. A planting date or window must have been reported. - 2. If planting uncertainty is too high, it may not have been possible to infer the planting date. - 3. Sites with large spatial uncertainty (>50km radius) are excluded. - 4. Climate statistics have not been calculated for animal experiments. - 5. Climate statistics are not calculated for spatially aggregated sites, products or time periods.
The annual and long-term average datasets are small, we can simply download them from the ERA s3 bucket.
# Set the remote S3 path and local save paths3_data_dir <-"s3://digital-atlas/era/geodata"# List and filter filess3<-s3fs::S3FileSystem$new(anonymous = T)files_s3 <- s3$dir_ls(s3_data_dir)file_ltavg<-grep("chirps_ltavg.*parquet", files_s3, value =TRUE)file_annnual<-grep("chirps_annual.*parquet", files_s3, value =TRUE)# Filter to most recent version of dataset(file_ltavg <-tail(file_ltavg, 1))
files_s3<-c(file_ltavg,file_annnual)# Create local file path and downloadfiles_local <-gsub(s3_data_dir, local_data_dir, files_s3)for(i in1:length(files_local)){if(!file.exists(files_local[i])){ s3$file_download(files_s3[i], files_local[i]) }}# Load ltavg and annual datachirps_ltavg<-arrow::read_parquet(files_local[1])chirps_annual<-arrow::read_parquet(files_local[2])
The daily CHIRPS dataset id quite large, let’s use the arrow package to download the head of the data only. To learn more about using the arrow package to access parquet data in R see https://arrow.apache.org/docs/r/.
In future ERA updates we will optimize the partition structure of parquet tables to faciliate faster access, in the short-term we suggest working locally with files is still the best option.
# Load head of daily data onlyfiles_s3 <- s3$dir_ls(s3_data_dir)file_daily<-grep("chirps.*parquet", files_s3, value =TRUE)file_daily<-file_daily[!grepl("ltavg|annual",file_daily)](file_daily <-tail(file_daily, 1))
files_local <-gsub(s3_data_dir, local_data_dir, file_daily)if(!file.exists(files_local)){ chirps_daily<-open_dataset(file_daily, format ="parquet", filesystem = s3)# Read the first 5 rows into a data.table chirps_daily <-as.data.table(head(chirps_daily, 5))# Save result arrow::write_parquet(chirps_daily,files_local)}else{ chirps_daily<-arrow::read_parquet(files_local)}
The annual and long-term average datasets are small, we can simply download them from the ERA s3 bucket.
# Set the remote S3 path and local save paths3_data_dir <-"s3://digital-atlas/era/geodata"# List and filter filess3<-s3fs::S3FileSystem$new(anonymous = T)files_s3 <- s3$dir_ls(s3_data_dir)file_ltavg<-grep("POWER_ltavg.*parquet", files_s3, value =TRUE)file_annnual<-grep("POWER_annual.*parquet", files_s3, value =TRUE)# Filter to most recent version of dataset(file_ltavg <-tail(file_ltavg, 1))
files_local <-gsub(s3_data_dir, local_data_dir, file_daily)if(!file.exists(files_local)){ s3$file_download(file_daily, files_local)# Subset to the first 5 rows power_daily <-arrow::read_parquet(files_local)[1:5]# Save result arrow::write_parquet(power_daily,files_local)}else{ power_daily<-arrow::read_parquet(files_local)}
# Set the remote S3 path and local save paths3_data_dir <-"s3://digital-atlas/era/geodata"# List and filter filess3<-s3fs::S3FileSystem$new(anonymous = T)files_s3 <- s3$dir_ls(s3_data_dir)files_s3<-files_s3[!grepl("watbal",files_s3)]file_soilgrids<-grep("soilgrids2.0.*parquet", files_s3, value =TRUE)file_isda<-grep("isda.*parquet", files_s3, value =TRUE)# Filter to most recent version of dataset(file_soilgrids <-tail(file_soilgrids, 1))
files_s3<-c(file_isda,file_soilgrids)# Create local file path and downloadfiles_local <-gsub(s3_data_dir, local_data_dir, files_s3)for(i in1:length(files_local)){if(!file.exists(files_local[i])){ s3$file_download(files_s3[i], files_local[i]) }}# Load data# Note the soilgrids data is quite large (>150 Mb) so it will take a few minutes to downloadsoilgrids<-arrow::read_parquet(files_local[1])isda<-arrow::read_parquet(files_local[2])
# Replace s3 path with https path so we can read directly into Rhttp_path <-gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)soilgrids_metadata<-fread(http_path)
Warning in .mapply(list, x, NULL): longer argument not a multiple of length of
shorter
files_s3<-grep("isda.*metadata", files_s3, value =TRUE)# Replace s3 path with https path so we can read directly into Rhttp_path <-gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)isda_metadata<-fread(http_path)
files_local <-gsub(s3_data_dir, local_data_dir, files_s3)# File size is about 40 MB, so download will take some time depending on your connectionif(!file.exists(files_local)){ s3$file_download(files_s3, files_local)}
[1] "downloaded_data/sos_2025-04-13.RData"
# Load sos datasos<-miceadds::load.Rdata2(file=basename(files_local),path=dirname(files_local))names(sos)
The start of season (sos) calculations /R/add_geodata/calculate_sos.R process raw climate data to derive robust growing-season indicators at multiple temporal scales. In essence, it integrates high-resolution (dekadal) climate records with monthly and seasonal aggregations to compute metrics such as the start of season (SOS), end of season (EOS), length of growing period (LGP), and total rainfall. This multi-layered approach is designed for informed agricultural planning and climate adaptation analysis.
We do not provide detailed descriptions of all the field present in the sos tables, this can be found in the metadata table. The table we make use of the climate statistic calculations is sos$LTAvg_SOS2 which tells on the average onset of the rainy season (in dekads) for a location.
metadata
files_s3 <- s3$dir_ls(s3_data_dir)files_s3<-grep("sos.*metadata.*csv", files_s3, value =TRUE)(files_s3 <-tail(files_s3, 1))
# Replace s3 path with https path so we can read directly into Rhttp_path <-gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)sos_metadata<-fread(http_path)
Key Metrics: Rainfall, potential evapotranspiration (ETo), aridity index (AI), and computed dekad values.
Purpose: Provides the high-resolution temporal detail needed to identify seasonal transitions and establish baseline climate conditions.
Seasonal_SOS2
Description: Aggregates dekadal data into a seasonal view focused on the primary growing season.
Key Metrics:
SOS: The first dekad when conditions indicate the start of the season.
EOS: The last dekad of the season.
LGP: The length of the growing period (calculated as the difference between EOS and SOS).
Tot.Rain: Total rainfall during the season.
Methodology: Uses rolling sums and fixed threshold criteria (e.g., minimum rainfall, aridity conditions) to define the rainy period, with padding applied to manage edge effects.
3.LTAvg_SOS2 - Description: Provides long-term average statistics derived from the primary seasonal data.
- Key Metrics:
- Mean, median, minimum, and maximum SOS values.
- Average EOS, LGP, and average total seasonal rainfall (mean seasonal precipitation MSP).
- Proportions indicating seasonal transitions across calendar years. - Purpose: Summarizes seasonal behavior over the full record, highlighting variability and central tendencies.
Description: These datasets mirror Seasonal_SOS2 and LTAvg_SOS2 but pertain to an additional (often secondary) growing season, which may be present in more humid regions.
Key Metrics & Purpose: Similar to the primary season outputs, these capture SOS, EOS, LGP, and rainfall for the secondary season, adding nuance to regions where multiple growing periods exist.
Data Integration: The script merges datasets from the POWER and CHIRPS sources—substituting CHIRPS rainfall into the POWER dataset—to leverage the strengths of both and ensure more accurate rainfall data.
Temporal Aggregation: Daily data are first converted to dekadal values. These are further aggregated into monthly summaries and then into seasonal periods using custom functions (e.g., SOS_Dekad, SOS_SeasonPad).
Threshold-Based Filtering: Fixed criteria (e.g., a minimum rainfall threshold of 200 mm, an aridity index cutoff) are applied to delineate season boundaries. While these thresholds are clearly defined, they prompt a critical question: are they universally applicable, or do they require recalibration for different regions and evolving climate conditions?
Handling Season Transitions: The script manages scenarios where seasons cross calendar boundaries, excludes incomplete years, and applies specific padding rules to balance season lengths. Custom sequence functions (e.g., SOS_UniqueSeq, SOS_SeqMerge) play a key role in ensuring the integrity of the seasonal identification.
Fixed Parameters vs. Regional Flexibility: The use of fixed thresholds (for rainfall and aridity) is straightforward but invites scrutiny. It is important to ask whether these parameters are optimal for all regions, especially in a changing climate. Future iterations might consider adaptive or region-specific thresholds.
Modular and Adaptable Structure: The script’s modular design—with separate outputs for dekadal details, seasonal summaries, and long-term averages—allows for flexibility. This structure facilitates updates and refinements, such as integrating more dynamic statistical methods or machine learning approaches to adjust thresholds.
Robustness in Data Quality: Steps to exclude incomplete data and adjust for season transitions add robustness, but constant validation against observed ground conditions is critical for long-term reliability.
The output datasets provide a comprehensive picture of growing season dynamics: - Dekadal_SOS offers a fine-scale temporal resolution. - Seasonal_SOS2 and LTAvg_SOS2 capture the primary and secondary growing season’s characteristics. - Seasonal_SOS3 and LTAvg_SOS3 extend this analysis to potential thrid season.
This structure supports detailed analysis and decision-making in agricultural and climate adaptation planning. However, while the methodology is thorough, questioning the fixed thresholds and continuously validating the approach against real-world data remains essential for maintaining relevance in a forward-thinking, dynamic climate context.
1.8.6 AEZ
1.8.6.1 Access
files_s3 <- s3$dir_ls(s3_data_dir)files_s3<-grep("aez_.*parquet", files_s3, value =TRUE)(files_s3 <-tail(files_s3, 1))
Latitude: Geographic latitude of the site (decimal degrees, WGS84).
Longitude: Geographic longitude of the site (decimal degrees, WGS84).
Site.Key: Unique site identifier used throughout ERA.
Buffer: Radius (in meters) used to extract AEZ values from raster data.
prop: Proportion of the buffer area covered by the dominant AEZ category.
value: Numeric AEZ class code assigned by the source dataset.
dataset: The AEZ raster dataset used, e.g., "004_afr-aez_09.tif" or "AEZ8_CLAS--SSA.tif".
value_cat: Human-readable label for the AEZ zone, derived from the class value using an external key or metadata file (e.g., "Tropic - cool / humid").
1.8.7 Soil Moisture
Daily soil moisture balance is calculated using a simple water balance model implemented in water_balance.R.
This model simulates daily soil water availability, evaporative demand, and logging risk for each site, based on rainfall (CHIRPS), temperature and radiation (NASA POWER), and soil properties derived from ISDA (Africa) or SoilGrids 2.0 (non-Africa). These daily values are the foundation for seasonal summaries of water stress (ERATIO) and excess moisture (LOGGING).
ERATIO (Evaporative Ratio) is the ratio of actual evapotranspiration (Ea) to potential evapotranspiration (Ep).
Values near 1 indicate sufficient water availability—plants are able to meet atmospheric demand.
Values < 0.5 suggest moderate to severe water stress, where crop water needs are not being met.
Daily ERATIO values are used to summarize frequency, duration, and intensity of drought conditions across a season.
LOGGING represents the amount of water in the soil above field capacity but below saturation.
Positive values indicate periods where excess water may restrict oxygen availability to roots (i.e., waterlogging).
Used to flag moisture stress due to excess rainfall or poor drainage.
1.8.7.1 Access
files_s3 <- s3$dir_ls(s3_data_dir)# Substitute isda for soilgrids to access the isda soil grids data (for african site)# Hear we are using soilgrids2.0 because the file size is more convinient for this vignettefiles_s3<-grep("watbal.*soilgrids.*parquet", files_s3, value =TRUE)(files_s3 <-tail(files_s3, 1))
files_local <-gsub(s3_data_dir, local_data_dir, files_s3)# File size is about x MB, so download will take some time depending on your connectionif(!file.exists(files_local)){ s3$file_download(files_s3, files_local)}
Each row represents a unique site-day combination.
Field Descriptions:
Site.Key: Unique identifier for the ERA site.
scp: Soil water holding capacity at field capacity (mm). Estimated from ISDA or SoilGrids based on pedotransfer rules.
ssat: Soil saturation point (mm). Maximum amount of water the soil can hold.
DATE: Observation date (daily time step).
TMIN: Minimum daily air temperature (°C), from NASA POWER.
TMAX: Maximum daily air temperature (°C), from NASA POWER.
TMEAN: Mean daily air temperature (°C).
RAIN: Daily precipitation (mm), from CHIRPS.
SRAD: Surface solar radiation (MJ/m²/day), from NASA POWER.
ETMAX: Potential evapotranspiration (PET, mm/day), calculated using the Priestley–Taylor method.
AVAIL: Estimated soil water available to crops (mm). Simulated daily from soil and rainfall inputs.
DEMAND: Crop water demand (mm). Equal to ETMAX if water is not limiting.
ERATIO: Evaporative ratio (Ea/Ep) — actual evapotranspiration divided by PET. A proxy for crop water stress.
LOGGING: Simulated waterlogging value (mm above field capacity, but below saturation).
RUNOFF: Excess rainfall (mm) beyond soil saturation capacity; assumed to be lost as runoff or deep percolation.
1.8.8 Bioclim
This will be made available in future updates. If you have a critical need for this information then please contact the ERA team and we can prioritize these data.
2 Acknowledgements
These open-source scripts were delivered for and funded by the Agroecology in the Dry Corridor of Central America (ACDC) project
3 Contact Us
For more details or to explore collaborative opportunities: