Climate & Soils Data in ERA

Author

Alliance of Bioversity International & CIAT

Published

July 7, 2025

1 How ERA Connects to Geospatial Climate and Soils Data

Intended Users

This documentation is intended for technical users working with the ERA meta-dataset who wish to integrate seasonally relevant climate statistics into agronomic observations. Users do not need to rerun the calculations—preprocessed climate data are provided on S3—but may use this guide to understand:

What climate indicators were generated
How planting dates and season lengths were determined
Where to find the data and how to merge them with ERA observations
Where to find the code used to generate and process data

Background

We developed a geospatial enrichment pipeline to augment ERA’s agronomic experiments with high-resolution climate, soil, and elevation data, linked to specific crops, locations, and growing seasons. Each observation is connected to daily weather time series and soil attributes based on its site coordinates and reported or inferred planting and harvest dates. Where precise dates are unavailable, the pipeline uses a tiered imputation approach—drawing on published planting windows, nearby analogs, and agroclimatic indicators such as rainfall onset—to estimate a plausible growing season. This enables the calculation of detailed climate statistics for the period most relevant to crop development, while excluding records with excessive spatial or temporal uncertainty.

The enrichment process applies only to crop-based experiments. Climate statistics are generated only where both spatial and temporal resolution meet defined quality thresholds—specifically, where the site location is known within 50 km and the cropping calendar can be clearly determined. Records from animal feed experiments, as well as spatially or temporally aggregated data (e.g., regional summaries or multi-year averages), are not included. As a result, only a subset of ERA observations receive climate enrichment—those with sufficient detail to anchor the analysis in a specific place and season.

1.1 Data Sources

The ERA pipeline enriches observations with climate, soil, and landscape data using custom functions stored in:

R/add_geodata/: main dataset scripts
R/add_geodata/functions/: core download and utility functions

Below are the datasets used and the script locations.

1.1.1 CHIRPS (Rainfall)

Dataset: CHIRPS Daily Rainfall
Resolution: 0.05° (~5.5 km)
Coverage: Africa and globally, 1981–present
Download Source: https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_daily/tifs/p05/
Download Script: R/add_geodata/chirps.R
Download Function: R/add_geodata/functions/download_chirps.R
Notes: Filenames include the date (e.g. chirps.2023.01.01.tif.gz). Versioning is inferred from file date.

1.1.2 POWER (NASA)

Dataset: NASA POWER (Temperature, Radiation, Wind, etc.)
Resolution: 0.5° lat × 0.625° lon
Coverage: global, ~1983–present
Download Source: NASA POWER API — https://power.larc.nasa.gov/api/temporal/daily/
Script: R/add_geodata/power.R
1.2 Download Function: R/add_geodata/functions/download_power.R

1.2.1 Soil Data Sources

Soil data are used to estimate key properties like water-holding capacity, which underpin the calculation of climate indicators such as Eratio and waterlogging. Two soil datasets are used depending on site location:

1.2.1.1 iSDAsoil (Africa only)

Dataset: iSDAsoil
Resolution: 30 m
Coverage: Sub-Saharan Africa
Source: https://www.isda-africa.com/isdasoil
Use: Used for all African sites in ERA. Offers high-resolution predictions of soil texture, carbon, pH, and depth—well-suited to the diversity of African agroecosystems.
Download Scripts: soilgrids.R

1.2.1.2 SoilGrids 2.0 (Non-Africa)

Dataset: SoilGrids 2.0 (ISRIC)
Resolution: 250 m
Coverage: Global
Use: Applied to non-African sites in ERA.
Download Scripts: soilgrids2.R.
Functions: download_soilgrids2.R
Notes: Accesses data via soilDB::fetchSoilGrids() and reshapes raster. Outputs CSV summaries per site and variable.

1.3 > We plan to extend SoilGrids 2.0 to African sites in a future update, allowing for harmonized coverage across all regions.

1.3.1 AEZ (Agro-Ecological Zones)

Layers Used:
- AEZ16_CLAS--SSA.tif: from Harvard Dataverse
- 004_afr-aez_09.tif: from ISRIC server
Script: R/add_geodata/aez.R.
Notes: The ISRIC AEZ layer is recoded with value-to-label mappings from a CSV.

1.3.2 Elevation (DEM)

Dataset: Elevation raster
Download Script: R/add_geodata/elevation.R.
Notes: Processed from SRTM or other public sources.

1.3.3 Water Balance & Onset of Rain

Water Balance: R/add_geodata/water_balance.R
Onset Date (Start of Season): R/add_geodata/calculate_sos.R

1.4 Methods

The generate_climate_stats.R pipeline constructs crop-specific seasonal windows and computes derived climate indicators for each observation in the ERA agronomy dataset. These indicators are designed to reflect climate conditions experienced during the growing season, rather than general climatological conditions.

Each observation is linked to a custom seasonal window based on:

Reported Planting and Harvest Dates: If available, these dates define the crop’s growing period directly.
Imputed Dates: Where planting or harvest dates are missing or uncertain, the pipeline estimates plausible values using:
- Nearby observations (within 1–10 km)
- Published planting calendars
- Agroclimatic thresholds (e.g. start of rainy season based on dekadal CHIRPS rainfall)
Season Length Estimation: Season length is either taken from the original dataset, imputed from nearby records, or inferred from EcoCrop definitions of crop cycle duration.
Alternate Windows: In addition to the main growing period, alternate windows are used for specific purposes:
- PDate.SLen.EcoCrop: uses EcoCrop-inferred season length
- PDate.SLen.P30: fixed 30-day window after planting (used to assess early-season climate stress)

These windows allow climate statistics to be calculated only for periods relevant to crop development, improving interpretation compared to annual or calendar-based averages.

Climate Statistics Generated

Unlike the foundational datasets (e.g., daily rainfall, temperature, radiation), which provide raw gridded values, this pipeline produces seasonally aggregated statistics aligned with cropping windows. These include:

Temperature: Mean, max, min, and variability of daily temperatures; heat stress indicators (e.g., number of days >35°C).
Rainfall: Total rainfall, dry spell frequency, rainfall adequacy.
Growing Degree Days (GDD): Thermal accumulation across sub-optimal, optimal, and heat-stressed temperature bands.
Evaporative Ratio (ERatio): Daily ratio of actual to potential evapotranspiration — a proxy for drought stress.
Waterlogging (Logging): Estimated soil moisture excess above field capacity, indicating excess moisture risk.
Dry Spells: Frequency, length, and timing of low-rainfall periods.

Each of these indicators is calculated per site–season–crop combination using the daily CHIRPS and POWER datasets and simulated water balance (see water_balance.R).

1.5 These derived indicators provide a biophysically relevant summary of climate exposure tailored to the actual growing period of each crop, making them more actionable than raw daily data or long-term averages.

1.5.1 Downloading the climate data

To access the climate statistics generated for ERA observations, download the harmonized .RData file from the geodata directory on S3:

S3 location: s3://digital-atlas/era/geodata/clim_stats_2025-03-18.RData
Content: This file contains daily and seasonal climate summaries per site, ready to be joined with ERA observations.

You can download the file using the s3fs interface as follows:

# Set the remote S3 path and local save path
s3_data_dir <- "s3://digital-atlas/era/geodata"
local_data_dir <- "downloaded_data"

# List and filter files
s3<-s3fs::S3FileSystem$new(anonymous = T)
files_s3 <- s3$dir_ls(s3_data_dir)
files_s3 <- grep("clim_stats.*RData", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/clim_stats_2025-04-14.1.RData"

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)
if(!file.exists(files_local)){
s3$file_download(files_s3, files_local)
}

Once downloaded, load the .RData file using:

# Load the harmonized climate data into your environment
clim_data <- miceadds::load.Rdata2(file = basename(files_local), path = dirname(files_local))

1.6 Climate data content and structure

clim_dat is a named list of data tables, created by generate_climate_stats.R.

names(clim_data)

[1] "PDate.SLen.Data"    "PDate.SLen.EcoCrop" "PDate.SLen.P30"    
[4] "site_data"

site_data: contains the spatial and temporal location data for which climate statistic are generated.

PDate.SLen.Data, PDate.SLen.EcoCrop,PDate.SLen.P30: these objects are lists of output climate data calculated for for different parameterizations of season length.

1.6.1 Unique locations and times (`clim_data$site_data`)

site_data contains the unique combinations of site, time, crop, planting date, and harvest date from the ERA agronomy dataset.

1.6.1.1 Site, year, season, & study

head(unique(clim_data$site_data[,.(Site.Key,Code,M.Year,Latitude,Longitude,M.Year,M.Year.Code,M.Season)]))|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Code	M.Year	Latitude	Longitude	M.Year	M.Year.Code	M.Season
-0.0023 34.5939 B300	NN0381	2010	-0.00230	34.59390	2010	NA	NA
-0.0023 34.5939 B300	LM0251	1987	-0.00230	34.59390	1987	NA	NA
-0.0108 36.9617 B250	LM0235	2002.1	-0.01083	36.96167	2002.1	1	1
-0.0420 34.5920 B12500	LM0267	1990.2	-0.04200	34.59200	1990.2	NA	NA
-0.0420 34.5920 B12500	LM0267	1991.1	-0.04200	34.59200	1991.1	NA	NA
-0.0420 34.5920 B12500	LM0267	1991.2	-0.04200	34.59200	1991.2	NA	NA

Field Descriptions:

Site.Key: A unique identifier for each site or location. It is used to link locations consistently across datasets.
Code: A unique code used to identify a publication or entry in the ERA dataset. It serves as the main key for tracking a specific experiment/publication across associated tables.
M.Year: Measurement year – a code that identifies the production season, typically aligned with the Time field in the main ERA dataset. This may take the form of a calendar year or include other formatting to distinguish multiple seasons per year.
`Latitude: Geographic latitude of the site in decimal degrees (WGS84). Used for spatial analyses and mapping.
Longitude: Geographic longitude of the site in decimal degrees (WGS84). Used for spatial analyses and mapping.
M.Year.Code: A standardized or formatted version of M.Year, often combining year and season. Useful for indexing and subsetting.
M.Season: Management season (typically 1 or 2) indicating the cropping season within a year. May be NA in unimodal systems; helps distinguish multiple cropping events in bimodal climates.

1.6.1.2 Crops

These fields contain thresholds that define a crop’s temperature response curve and come from EcoCrop. They can also be used to calculate growing degree days, stress indices, or suitability zones under historical or future climate conditions.

head(unique(clim_data$site_data[,.(Product,EU,Topt.low,Topt.high,Tlow,Thigh)]))|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Product	EU	Topt.low	Topt.high	Tlow	Thigh
Maize	c7	18	33	10	47
Common Bean	h14	16	25	7	32
Sorghum	c13	22	35	8	40
Potato	i7	15	25	7	30
Soybean	h13	20	33	10	38
Tomato (Total Yield)	e27.1	20	27	7	35

Field Descriptions:

Product: The name of the crop or agricultural product (e.g., maize, beans) associated with the management and outcome data.
EU: Experimental Unit code links to the era_master_codes$EU table.
Tlow: The minimum temperature threshold for crop development. Below this value, crop growth is assumed to be negligible or halted. Often derived from EcoCrop or agronomic sources.
Thigh: The maximum temperature threshold for crop development. Temperatures above this can lead to heat stress or failure in development.
Topt.low: The lower bound of the optimal temperature range for crop growth. Within this and Topt.high, the crop achieves near-optimal physiological performance.
Topt.high: The upper bound of the optimal temperature range for crop growth. Growth efficiency typically declines beyond this value, even if not fully stressed.

These thresholds define a crop’s temperature response curve and come from EcoCrop. They can also be used to calculate growing degree days, stress indices, or suitability zones under historical or future climate conditions.

1.6.1.3 Planting dates

site_data contains information about planting dates and their estimation:

head(clim_data$site_data[,.(Plant.Start,Plant.End,Plant.Diff.Raw,Data.PS.Date,Data.PE.Date,SOS,P.Date.Merge,P.Date.Merge.Source)])|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Plant.Start	Plant.End	Plant.Diff.Raw	Data.PS.Date	Data.PE.Date	SOS	P.Date.Merge	P.Date.Merge.Source
2010-03-22	2010-03-22	0	NA	NA	NA	14690	As_published
1987-03-13	1987-03-13	0	NA	NA	NA	6280	As_published
1987-03-13	1987-03-13	0	NA	NA	NA	6280	As_published
2002-02-15	2002-04-15	59	NA	NA	NA	11754	As_published CHIRPS
1990-07-15	1990-09-15	62	NA	NA	NA	7541	As_published CHIRPS
1991-02-15	1991-05-15	89	NA	NA	NA	7744	As_published CHIRPS

Field Descriptions:

Plant.Start: The reported start date for planting. This indicates when the planting period began according to the original data.
Plant.End: The reported end date for planting. This marks the conclusion of the planting period in the original dataset.
Plant.Diff.Raw: The difference (in days) between Plant.Start and Plant.End—indicating how uncertain the reported planting window was.
Data.PS.Date: The estimated start date for planting, inferred from nearby or similar observations in ERA when a reported planting date is missing or uncertain.
Data.PE.Date: The estimated end date for planting, derived using the same method as Data.PS.Date to define a plausible planting window.
SOS: The estimated Start of Season date, derived from daily climate data using agroclimatic thresholds (e.g. rainfall ≥25 mm in a dekad and ≥20 mm in the following two dekads, with aridity index AI ≥ 0.5). It marks when planting conditions were first met based on climatic signals.
P.Date.Merge: The final, merged planting date calculated by the pipeline. It represents a consolidated planting date that may incorporate adjustments or estimations (for example, averaging the planting window or refining it using rainfall data). It should be interpreted as the number of days since 1900-01-01.
P.Date.Merge.Source: A descriptive label indicating the source or method used to derive the merged planting date. This might indicate whether the date was taken directly from published data (e.g., “Published”) or estimated using spatial or rainfall data (e.g., “Nearby 1km”, “SOS + Published”, etc.).

Explanation of P.Date.Merge.Source values:

clim_data$site_data[,unique(P.Date.Merge.Source)]|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

x
As_published
As_published CHIRPS
NearbySeason_10km_Product ±42d CHIRPS
Nearby_SameYear&Season 1km
NearbySeason_1km_Product ±42d CHIRPS
SiteSeason_Product ±42d CHIRPS
SOS + As_published CHIRPS
Nearby_SameYear&Season 1km CHIRPS
Nearby_SameYear&Season 10km CHIRPS
Nearby_SameYear&Season 10km
SOS + Nearby_SameYear&Season 10km CHIRPS

Values below are presented in order of preference when estimating planting date in the P.Date.Merge field:

Published: The planting date was directly reported in the original study with no need for estimation.
Published CHIRPS: A published planting date was available but was refined or verified using CHIRPS rainfall data.
Nearby 1km CHIRPS : The estimation was based on observations from locations within a 1‑km radius, with additional refinement using CHIRPS data.
Nearby 10km CHIRPS: As with the 10‑km estimation, this method further incorporated CHIRPS rainfall data to improve the estimate.
Nearby 1km: Similar to the CHIRPS-based 1‑km estimate but without the additional rainfall data refinement.
Nearby 10km: The planting date was estimated from nearby observations aggregated over a 10‑km radius due to missing or uncertain reported dates.
SOS + Published: The planting date was adjusted using SOS information in cases where the published date was uncertain, without incorporating CHIRPS data.
SOS + Published CHIRPS: When the reported planting date (Published) was too uncertain, the method adjusted it using the Start‐of‐Season (SOS) rainfall onset data alongside CHIRPS information.

This hierarchy reflects a logical preference: Directly observed data > Nearby analogues > Climatological estimation.

1.6.1.4 Season length

site_data contains information about reported harvest dates and season length. Season length may use the reported dates or be estimated.

head(clim_data$site_data[,.(Harvest.Start,Harvest.End,SLen,Data.SLen,SLen.EcoCrop,SLen.Source,SeasonLength.Data,SeasonLength.EcoCrop)])|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Harvest.Start	Harvest.End	SLen	Data.SLen	SLen.EcoCrop	SLen.Source	SeasonLength.Data	SeasonLength.EcoCrop
2010-08-09	2010-08-09	140	NA	135	As_published + Published	140	135
1987-08-13	1987-08-13	153	NA	135	As_published + Published	153	135
NA	NA	NA	NA	101	NA	NA	101
NA	NA	NA	NA	135	NA	NA	135
NA	NA	NA	NA	135	NA	NA	135
NA	NA	NA	NA	135	NA	NA	135

Field Descriptions:

Harvest.Start: The reported or estimated date when harvest began. Typically reflects the first day of the harvest window.
Harvest.End: The reported or estimated date when harvest concluded. Typically reflects the last day of the harvest window.
SLen: Season Length – calculated as the number of days between Plant.Start and Harvest.End. Represents the observed or estimated duration of the cropping cycle.
Data.SLen: Season Length derived from reported data only (i.e., Plant.Start and Harvest.End must both be available from original records). Used to indicate where the season length is based on direct evidence rather than estimates.
SLen.EcoCrop: An estimate of cropping cycle length derived from the EcoCrop database refined using data available in ERA where possible. Used as a fallback when data-derived values are missing. SeasonLength.EcoCrop is redundant and contains the same information as SLen.EcoCrop.
SLen.Source: This field indicates how the final Season Length (SeasonLength.Data field) used in calculations was derived, based on the origin of planting and harvest date estimates. The format is:<Planting Source> + <Season Length Source>.
SeasonLength.Data: Combines SLen and Data.SLen fields, substituting values Data.SLen when SLen is NA.

Explanation of SLen.Source values:

clim_data$site_data[,unique(SLen.Source)]

 [1] "As_published + Published"                        
 [2] NA                                                
 [3] "CHIRPS As_published + Published"                 
 [4] "As_published + SLen Nearby 1km"                  
 [5] "Nearby_SameYear&Season 1km + Published"          
 [6] "Nearby_SameYear&Season 1km + SLen Nearby 1km"    
 [7] "SiteSeason_Product ±42d + SLen Nearby 1km"       
 [8] "NearbySeason_10km_Product ±42d + SLen Nearby 1km"
 [9] "SOS + As_published + Nearby 1km"                 
[10] "As_published + SLen Nearby 10km"                 
[11] "Nearby_SameYear&Season 1km + SLen Nearby 10km"   
[12] "CHIRPS SOS + As_published + Pub"                 
[13] "Nearby_SameYear&Season 10km + SLen Nearby 1km"   
[14] "NearbySeason_1km_Product ±42d + SLen Nearby 1km" 
[15] "NearbySeason_1km_Product ±42d + SLen Nearby 10km"
[16] "CHIRPS SiteSeason_Product ±42d + SLen Nearby 1km"

The format of SLen.Source is <Planting Source> + <Season Length Source> and the order of preference for the season length source is the same as for planting. Observed values include:
- Published + Pub – Both planting and harvest dates are reported with low uncertainty in the publication.
- Published + Nearby 1km – Planting date reported with low uncertainty; season length estimated from nearby (within 1 km) observations.
- CHIRPS Published + Pub – Planting date reported, but uncertain, and refined using CHIRPS rainfall; harvest dates reported with low uncertainty.
- Nearby 1km + Nearby 1km – Both planting date and season length derived from nearby (within 1 km) observations.
- Nearby 1km + Nearby 10km – Planting date from 1 km radius; season length from 10 km radius.
- SOS + Published + Nearby 1km –The planting date was adjusted using SOS information in cases where the published date was uncertain, without incorporating CHIRPS data; season length from nearby data.
- CHIRPS SOS + Published + Pub – When the reported planting date (Published) was too uncertain, the method adjusted it using the Start‐of‐Season (SOS) rainfall onset data alongside CHIRPS information; harvest dates reported with low uncertainty.
- Published + Nearby 10km – Planting date reported with low uncertainty; season length from 10 km proximity.
- Nearby 1km + Pub – Planting data from nearby; harvest dates reported with low uncertainty.
- Nearby 10km + Nearby 1km – Planting data from nearby;season length from 10 km proximity.
- NA – No season length estimate was available or derived.

These combinations trace the logical fallback and merging sequence for generating season length when direct data are missing or uncertain.

These can be merged with ERA observation data using the Site.ID and Time fields.

1.6.2 Climate data (`PDate.SLen.Data, PDate.SLen.EcoCrop, PDate.SLen.P30`)

Each of these climate window datasets contains a set of summary tables—one per climate indicator (e.g., temperature, rainfall, GDD)—with statistics calculated over the defined seasonal window for every crop-site-season combination that passed quality filters.

PDate.SLen.Data : site_data$P.Date.Merge and site_data$SeasonLength.Data are used to determine the start and end dates within which climate statistics are calculated. If season length is not reported or cannot be inferred from ERA data for a row in site_data then no climate stats will be generated for that record.

PDate.SLen.EcoCrop site_data$P.Date.Merge and site_data$SLen.EcoCrop are used to determine the start and end dates within which climate statistics are calculated. Season length is inferred from the midpoint of ecocrop cycle length for a crop, refined where possible using reported values within the ERA dataset. This dataset therefore inputes missing season length and contains more records than PDate.SLen.Data,however season length is likely to be less accurate.

PDate.SLen.P30 site_data$P.Date.Merge is used to determine the start date of the climate window, and the end date is fixed to 30 days after planting. This represent the post-planting climate, which can be a particularly sensitive period for many crops.

names(clim_data$PDate.SLen.Data)|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

x
gdd
temperature
rainfall
eratio
logging

Each of the following names corresponds to a list of climate statistics calculated over the seasonal window defined by P.Date.Merge and SeasonLength.Data:

gdd: Growing Degree Days — cumulative heat units over the season binned into thermal stress classes, useful for crop development and heat stress exposure tracking.
temperature: Mean, minimum, and maximum temperatures over the season. Consecutive and total days above/below temperature thresholds.
rainfall: Total and average precipitation during the season. Consecutive and total days above/below precipitation thresholds.
eratio: Ratio of rainfall to reference evapotranspiration — a proxy for water availability or drought stress.
logging: Days with waterlogging risk — based on rainfall thresholds that may indicate excess moisture conditions.

Each object is a data.table with one row per Site.ID and columns containing summary statistics for that climate indicator.

1.6.2.1 shared fields (index or key fields)

These fields are needed for merging the climate statistics back to the ERA comparisons table.

All tables share these fields:
- Site.Key: The site identifier for spatially reconnecting to the ERA comparisons table.
- M.Year: The time period identifier for temporally reconnecting to the ERA comparisions table.
- EU: The crop or animal product code. - Product: The crop or animal product name (this corresponds to the Product.Simple name field in ther ERA comparisons table) - Plant.Start: The original planting start date (as per the ERA comparisons table raw data).
- Plant.End: The original planting end date (as per the ERA comparisons table raw data).
- Harvest.Start: The original harvest start date (as per the ERA comparisons table raw data).
- Harvest.End: The original harvest end date (as per the ERA comparisons table raw data).

Additionally these shared fields are present: - window: Description of window used, useful if merging tables that use different climate window calculation methods.
- row_index : Internal index to link this row back to the corresponding entry in the site_data table.

1.6.2.2 gdd

head(clim_data$PDate.SLen.Data$gdd)|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

gdd_subopt	gdd_opt	gdd_aboveopt	row_index	M.Year	EU	Product	Plant.Start	Plant.End	Harvest.Start	Harvest.End	Site.Key	window
58.95	1659.66	0.00	1	2010	c7	Maize	2010-03-22	2010-03-22	2010-08-09	2010-08-09	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
132.65	1702.52	0.00	2	1987	c7	Maize	1987-03-13	1987-03-13	1987-08-13	1987-08-13	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
282.48	934.64	0.00	12	2001.2	c7	Maize	2001-10-01	2001-10-30	2002-02-01	2002-02-28	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
234.85	715.74	0.00	13	2002.1	c7	Maize	2002-04-01	2002-04-30	NA	NA	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
424.50	949.32	52.95	60	2004	c14	Wheat	2004-05-25	2004-05-25	2004-10-07	2004-10-07	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data
457.86	1011.53	61.63	61	2005	c14	Wheat	2005-05-14	2005-05-14	2005-10-02	2005-10-02	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data

This table contains Growing Degree Day (GDD) statistics calculated over the defined season window for each site. Here’s what each field represents:
- gdd_subopt: Cumulative GDD within the sub-optimal temperature range for crop growth (above base temperature but below optimal).
- gdd_opt: Cumulative GDD within the optimal temperature range — where the crop is expected to grow most efficiently.
- gdd_aboveopt: Cumulative GDD in the above-optimal range, where temperatures may begin to reduce growth efficiency.
- gdd_abovemax: Cumulative GDD above the maximum threshold, indicating heat stress or potentially damaging conditions.

1.6.2.3 temperature

head(clim_data$PDate.SLen.Data$temperature) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

tmin_min	tmin_mean	tmin_var	tmin_sd	tmin_range	tmax_max	tmax_mean	tmax_var	tmax_sd	tmax_range	tmean_max	tmean_min	tmean_mean	tmean_var	tmean_sd	tmean_range	row_index	M.Year	EU	Product	Plant.Start	Plant.End	Harvest.Start	Harvest.End	Site.Key	window
15.50	18.476809	0.8375705	0.9151888	5.09	30.58	26.18730	2.720730	1.649463	8.60	24.69	19.63	22.00830	0.9936999	0.9968450	5.06	1	2010	c7	Maize	2010-03-22	2010-03-22	2010-08-09	2010-08-09	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
15.39	17.872208	0.8065650	0.8980897	5.74	32.20	26.27351	6.365606	2.523015	10.89	25.41	19.00	21.75870	1.8366676	1.3552371	6.41	2	1987	c7	Maize	1987-03-13	1987-03-13	1987-08-13	1987-08-13	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
9.10	11.624133	1.4849358	1.2185794	5.24	29.41	25.09500	3.614292	1.901129	8.99	19.75	15.91	17.69460	0.7627982	0.8733832	3.84	12	2001.2	c7	Maize	2001-10-01	2001-10-30	2002-02-01	2002-02-28	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
9.29	11.624065	1.7777506	1.3333231	6.09	26.77	24.29724	1.467853	1.211550	6.54	18.63	15.65	17.36220	0.4043255	0.6358659	2.98	13	2002.1	c7	Maize	2002-04-01	2002-04-30	NA	NA	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
7.19	9.939118	1.5819044	1.2577378	5.36	25.11	21.47147	2.016235	1.419942	6.76	17.57	13.45	15.46684	0.6721388	0.8198407	4.12	60	2004	c14	Wheat	2004-05-25	2004-05-25	2004-10-07	2004-10-07	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data
6.37	10.546338	1.6467624	1.2832624	6.58	25.80	21.42162	2.694102	1.641372	8.99	17.54	13.63	15.64239	0.6342666	0.7964085	3.91	61	2005	c14	Wheat	2005-05-14	2005-05-14	2005-10-02	2005-10-02	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data

This table summarizes temperature-related climate statistics. Fields fall into two main categories:

1. Heat Stress Threshold Indicators (tmax_tg_*)

These fields summarize extreme high-temperature events, using thresholds of 35°C, 37.5°C, and 40°C. The same set of metrics is calculated for each threshold:

tmax_tg_[threshold].days: Total number of days where maximum temperature (Tmax) exceeded the threshold. e.g., tmax_tg_35.days = number of days > 35°C.
tmax_tg_[threshold].days_pr: Proportion of days in the season above the threshold.
tmax_tg_[threshold].max_rseq: Maximum length of any consecutive sequence of days above the threshold.
tmax_tg_[threshold].n_seq_dX: Number of sequences of at least X days where Tmax stayed above the threshold.
- d5: ≥5 consecutive days.
- d10: ≥10 consecutive days
- d15: ≥15 consecutive days

These indicators help assess the intensity, persistence, and frequency of heat stress.

2. General Temperature Statistics

These capture broader temperature behavior during the season:

Tmin-related fields:
- tmin_min: Minimum of daily minimum temperatures
- tmin_mean: Mean daily minimum temperature
- tmin_var: Variance of daily minimum temperatures
- tmin_sd: Standard deviation of daily minimum temperatures
- tmin_range: Difference between max and min daily minimum temperatures
Tmax-related fields:
- tmax_max: Maximum of daily maximum temperatures
- tmax_mean: Mean daily maximum temperature
- tmax_var: Variance of daily maximum temperatures
- tmax_sd: Standard deviation of daily maximum temperatures
- tmax_range: Difference between max and min daily maximum temperatures
Tmean (daily average temperature) fields:
- tmean_max: Maximum of daily mean temperatures
- tmean_min: Minimum of daily mean temperatures
- tmean_mean: Mean of daily mean temperatures
- tmean_var: Variance of daily mean temperatures
- tmean_sd: Standard deviation of daily mean temperatures
- tmean_range: Difference between max and min daily mean temperatures

These metrics provide a comprehensive description of temperature variability and extremes during the growing season.

1.6.2.4 rainfall

head(clim_data$PDate.SLen.Data$temperature) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

tmin_min	tmin_mean	tmin_var	tmin_sd	tmin_range	tmax_max	tmax_mean	tmax_var	tmax_sd	tmax_range	tmean_max	tmean_min	tmean_mean	tmean_var	tmean_sd	tmean_range	row_index	M.Year	EU	Product	Plant.Start	Plant.End	Harvest.Start	Harvest.End	Site.Key	window
15.50	18.476809	0.8375705	0.9151888	5.09	30.58	26.18730	2.720730	1.649463	8.60	24.69	19.63	22.00830	0.9936999	0.9968450	5.06	1	2010	c7	Maize	2010-03-22	2010-03-22	2010-08-09	2010-08-09	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
15.39	17.872208	0.8065650	0.8980897	5.74	32.20	26.27351	6.365606	2.523015	10.89	25.41	19.00	21.75870	1.8366676	1.3552371	6.41	2	1987	c7	Maize	1987-03-13	1987-03-13	1987-08-13	1987-08-13	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
9.10	11.624133	1.4849358	1.2185794	5.24	29.41	25.09500	3.614292	1.901129	8.99	19.75	15.91	17.69460	0.7627982	0.8733832	3.84	12	2001.2	c7	Maize	2001-10-01	2001-10-30	2002-02-01	2002-02-28	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
9.29	11.624065	1.7777506	1.3333231	6.09	26.77	24.29724	1.467853	1.211550	6.54	18.63	15.65	17.36220	0.4043255	0.6358659	2.98	13	2002.1	c7	Maize	2002-04-01	2002-04-30	NA	NA	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
7.19	9.939118	1.5819044	1.2577378	5.36	25.11	21.47147	2.016235	1.419942	6.76	17.57	13.45	15.46684	0.6721388	0.8198407	4.12	60	2004	c14	Wheat	2004-05-25	2004-05-25	2004-10-07	2004-10-07	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data
6.37	10.546338	1.6467624	1.2832624	6.58	25.80	21.42162	2.694102	1.641372	8.99	17.54	13.63	15.64239	0.6342666	0.7964085	3.91	61	2005	c14	Wheat	2005-05-14	2005-05-14	2005-10-02	2005-10-02	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data

This table summarizes rainfall-related climate statistics.

1. Total and Derived Rainfall Metrics - rain_sum: Total rainfall (mm) accumulated over the observation window.
- eto_sum: Total reference evapotranspiration (mm) over the window, calculated from NASA POWER data.
- eto_na: Number of days with missing ETO values due to data unavailability.
- w_balance: Approximate seasonal water balance: rain_sum – eto_sum.
- w_balance_negdays: Number of days when daily rainfall < daily evapotranspiration (i.e., water deficit days).

2. Dry Spell Indicators (rain_l_*)

These indicators summarize dry spells using thresholds of 0.1 mm, 1 mm, and 5 mm of daily rainfall.

For each threshold:
- rain_l_[threshold].days: Total number of days below the rainfall threshold. e.g., rain_l_1.days = number of days with rainfall < 1 mm.
- rain_l_[threshold].days_pr Proportion of total days below the threshold.
- rain_l_[threshold].max_seq: Length of the longest consecutive sequence of dry days.
- rain_l_[threshold].n_seq_dX:Number of dry spells lasting at least X days:
- d5 = ≥5 consecutive days
- d10 = ≥10 consecutive days
- d15 = ≥15 consecutive days

Thresholds used: - rain_l_0.1: Very light rainfall (effectively dry)
- rain_l_1: Light rainfall
- rain_l_5: Moderate rainfall threshold

These variables help identify drought risk, intra-seasonal dry periods, and rainfall distribution relevant to crop growth.

1.6.2.5 eratio

head(clim_data$PDate.SLen.Data$eratio) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

eratio_mean	eratio_median	eratio_min	eratio_l_0.5.days	eratio_l_0.5.days_pr	eratio_l_0.5.max_seq	eratio_l_0.5.n_seq_d5	eratio_l_0.5.n_seq_d10	eratio_l_0.5.n_seq_d15	eratio_l_0.25.days	eratio_l_0.25.days_pr	eratio_l_0.25.max_seq	eratio_l_0.25.n_seq_d5	eratio_l_0.25.n_seq_d10	eratio_l_0.25.n_seq_d15	eratio_l_0.1.days	eratio_l_0.1.days_pr	eratio_l_0.1.max_seq	eratio_l_0.1.n_seq_d5	eratio_l_0.1.n_seq_d10	eratio_l_0.1.n_seq_d15	row_index	M.Year	EU	Product	Plant.Start	Plant.End	Harvest.Start	Harvest.End	Site.Key	window
0.8575887	1.000	0.23	25	0.18	15	2	1	0	1	0.01	1	0	0	0	0	0.00	0	0	0	0	1	2010	c7	Maize	2010-03-22	2010-03-22	2010-08-09	2010-08-09	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
0.7500000	0.900	0.14	36	0.23	14	2	2	0	9	0.06	7	1	0	0	0	0.00	0	0	0	0	2	1987	c7	Maize	1987-03-13	1987-03-13	1987-08-13	1987-08-13	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
0.2732000	0.100	0.01	116	0.77	41	3	3	3	94	0.63	27	6	4	2	73	0.49	23	5	2	1	12	2001.2	c7	Maize	2001-10-01	2001-10-30	2002-02-01	2002-02-28	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
0.2997561	0.150	0.01	92	0.75	34	4	4	3	70	0.57	31	4	3	1	56	0.46	29	4	1	1	13	2002.1	c7	Maize	2002-04-01	2002-04-30	NA	NA	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
0.2817647	0.205	0.01	112	0.82	40	5	5	2	75	0.55	17	6	3	1	38	0.28	11	3	1	0	60	2004	c14	Wheat	2004-05-25	2004-05-25	2004-10-07	2004-10-07	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data
0.5081690	0.445	0.02	77	0.54	16	6	3	2	43	0.30	14	3	2	0	21	0.15	10	2	0	0	61	2005	c14	Wheat	2005-05-14	2005-05-14	2005-10-02	2005-10-02	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data

These variables describe evaporative ratio (Eratio) statistics, which serve as a proxy for water stress during the crop season.
Eratio is computed as the ratio of actual evapotranspiration (Ea) to potential evapotranspiration (Ep), based on a daily water balance simulation that accounts for rainfall, PET, and soil water-holding capacity:

Eratio = Ea / Ep

Ep (potential evapotranspiration) is calculated using the Priestley–Taylor method.
Ea is estimated by simulating daily water availability in the soil, using a simple empirical model based on soil capacity and depletion (see calc_daily_watbal() in watbal_all_in_one.R).
Soil properties (e.g., field capacity, saturation, depth) are estimated using a pedotransfer function (AWCPTF()), and aggregated with soilcap_calc().

This approach integrates soil, rainfall, and climate to better reflect actual water supply to crops, beyond rainfall alone.

Low values indicate water deficits, while higher values suggest sufficient water supply relative to atmospheric demand.

1. Summary Eratio Statistics

eratio_mean: Mean daily Eratio over the observation window.
eratio_median: Median daily Eratio.
eratio_min: Minimum daily Eratio (most severe water deficit day).

2. Water Stress Indicators (eratio_l_*)

These fields capture frequency, duration, and intensity of low Eratio events, using thresholds of <0.5, <0.25, and <0.1.

For each threshold: - eratio_l_[threshold].days: Number of days where Eratio fell below the threshold. e.g., eratio_l_0.5.days = number of days with Eratio < 0.5.
- eratio_l_[threshold].days_pr: Proportion of total days with Eratio below the threshold.
- eratio_l_[threshold].max_seq: Maximum consecutive sequence of days below the threshold.
- eratio_l_[threshold].n_seq_dX: Number of spells of at least X consecutive days below the threshold:
- d5 = ≥5 consecutive days
- d10 = ≥10 consecutive days
- d15 = ≥15 consecutive days

Thresholds represent escalating levels of water stress: - 0.5: Mild deficit
- 0.25: Moderate deficit
- 0.1: Severe deficit

These metrics can be used to identify seasonal water stress risk, evaluate drought periods, and inform adaptive irrigation or planting strategies.

1.6.2.6 logging

head(clim_data$PDate.SLen.Data$logging) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

logging_sum	logging_mean	logging_present_mean	logging_g_0.days	logging_g_0.days_pr	logging_g_ssat_0.5.days	logging_g_ssat_0.5.days_pr	logging_g_ssat_0.5.max_seq	logging_g_ssat_0.5.n_seq_d5	logging_g_ssat_0.5.n_seq_d10	logging_g_ssat_0.5.n_seq_d15	logging_g_ssat_0.9.days	logging_g_ssat_0.9.days_pr	logging_g_ssat_0.9.max_seq	logging_g_ssat_0.9.n_seq_d5	logging_g_ssat_0.9.n_seq_d10	logging_g_ssat_0.9.n_seq_d15	row_index	M.Year	EU	Product	Plant.Start	Plant.End	Harvest.Start	Harvest.End	Site.Key	window
7.80	0.0553191	0.60	13	0.09	13	0.09	62	4	3	2	13	0.09	62	4	3	2	1	2010	c7	Maize	2010-03-22	2010-03-22	2010-08-09	2010-08-09	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
1.80	0.0116883	0.60	3	0.02	3	0.02	86	2	2	2	3	0.02	86	2	2	2	2	1987	c7	Maize	1987-03-13	1987-03-13	1987-08-13	1987-08-13	-0.0023 34.5939 B300	PlantingDate-SeasonLength.Data
3.15	0.0210000	0.45	7	0.05	7	0.05	102	4	2	2	7	0.05	102	4	2	2	12	2001.2	c7	Maize	2001-10-01	2001-10-30	2002-02-01	2002-02-28	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
2.70	0.0219512	0.45	6	0.05	6	0.05	106	1	1	1	6	0.05	106	1	1	1	13	2002.1	c7	Maize	2002-04-01	2002-04-30	NA	NA	-0.0833 37.0000 B917	PlantingDate-SeasonLength.Data
0.00	0.0000000	0.00	0	0.00	0	0.00	0	0	0	0	0	0.00	0	0	0	0	60	2004	c14	Wheat	2004-05-25	2004-05-25	2004-10-07	2004-10-07	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data
0.00	0.0000000	0.00	0	0.00	0	0.00	0	0	0	0	0	0.00	0	0	0	0	61	2005	c14	Wheat	2005-05-14	2005-05-14	2005-10-02	2005-10-02	-0.3780 35.9890 B500	PlantingDate-SeasonLength.Data

These variables summarize soil waterlogging conditions during the crop season.
Waterlogging is defined here as the amount of water held in the soil above field capacity but below saturation, simulated via a daily water balance using calc_daily_watbal() from watbal_all_in_one.R.

Logging occurs when incoming rainfall exceeds the soil’s capacity to retain water at field capacity, but has not yet exceeded total saturation.

1. Summary Waterlogging Statistics

logging_sum: Total cumulative logging value across the observation window.
logging_mean: Mean daily logging value.
logging_median: Median daily logging value.
logging_present_mean: Mean logging value on days when waterlogging was present (i.e., > 0).

2. General Waterlogging Presence (logging_g_0.*)

These fields indicate periods when water balance > 0, a proxy for general waterlogging.

logging_g_0.days: Number of days where waterlogging > 0.
logging_g_0.days_pr: Proportion of days with waterlogging > 0.
logging_g_0.max_seq: Longest consecutive sequence of waterlogged days.
logging_g_0.n_seq_dX: Number of spells of X consecutive days with waterlogging:
- d5: ≥5 consecutive days
- d10: ≥10 consecutive days
- d15: ≥15 consecutive days

3. Saturation Threshold Indicators (logging_g_ssat_*)

These fields apply stricter thresholds based on soil saturation: - ssat_0.5: Moderate saturation (50% of saturation) - ssat_0.9: High saturation (90% of saturation)

For each threshold:

logging_g_ssat_[threshold].days: Number of days exceeding the saturation threshold.
logging_g_ssat_[threshold].days_pr: Proportion of season with saturation exceeded.
logging_g_ssat_[threshold].max_seq: Maximum consecutive days above threshold.
logging_g_ssat_[threshold].n_seq_dX: Number of long saturation spells:
- d5: ≥5 consecutive days
- d10: ≥10 consecutive days
- d15: ≥15 consecutive days

These indicators help assess excess moisture risks, which can influence root health, germination success, and yields.

1.7 Connecting climate stats back to the ERA database

1.7.1 ERA Comparisons Table

# Set the remote S3 path and local save path
s3_data_dir <- "s3://digital-atlas/era/data"
local_data_dir <- "downloaded_data"

# List and filter files
s3<-s3fs::S3FileSystem$new(anonymous = T)
files_s3 <- s3$dir_ls(s3_data_dir)
files_s3 <- grep("compiled.*mh.*parquet", files_s3, value = TRUE)

# Filter to most recent version of dataset
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/data/era_compiled_ls-v1.0-mh_2025-03-19.2-sc_2025_01_30.1-ie_2025_05_09.2-2025-05-09.1.parquet"

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)
if(!file.exists(files_local)){
  s3$file_download(files_s3, files_local)
}

# Load the data
era_comparisons<-arrow::read_parquet(files_local)

key_fields<-c("Site.Key","M.Year","Product.Simple","Plant.Start","Plant.End","Harvest.Start","Harvest.End")

# Climate data to merge
clim_mergedat<-clim_data$PDate.SLen.EcoCrop$gdd
# Rename the Product field to match the ERA comparisons table
setnames(clim_mergedat,"Product","Product.Simple")
# Remove unneeded columns 
clim_mergedat[,c("row_index","window","EU"):=NULL]
# Remove any duplicates
clim_mergedat<-unique(clim_mergedat)

# Merge datasets
era_comparisons_gdd<-merge(era_comparisons,clim_mergedat,by=key_fields,all.x=T,sort=F)

# Explore merge result
head(era_comparisons_gdd[!is.na(gdd_subopt),c(key_fields,grep("gdd",colnames(era_comparisons_gdd),value=T)),with=F])|>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	M.Year	Product.Simple	Plant.Start	Plant.End	Harvest.Start	Harvest.End	gdd_subopt	gdd_opt	gdd_aboveopt	gdd_abovemax

How many observations have we enriched?

era_comparisons_gdd[!is.na(gdd_subopt),.N]

[1] 0

era_comparisons_gdd[!is.na(gdd_subopt),.N]/era_comparisons_gdd[,.N]

[1] 0

Why is this less than half of the total data available? - 1. A planting date or window must have been reported. - 2. If planting uncertainty is too high, it may not have been possible to infer the planting date. - 3. Sites with large spatial uncertainty (>50km radius) are excluded. - 4. Climate statistics have not been calculated for animal experiments. - 5. Climate statistics are not calculated for spatially aggregated sites, products or time periods.

era_comparisons_gdd[!is.na(gdd_subopt),.N]/era_comparisons_gdd[!is.na(Plant.Start) & 
                                                                 Buffer<50000 &
                                                                 !grepl("[.][.]",Site.ID) & 
                                                                 !grepl("[.][.]",M.Year) &
                                                                 !grepl("-",Product.Simple),.N]

[1] 0

Non-matches, if present, indicate missing data in the era climate stats pipeline, please let us know and check for updates.

era_comparisons_gdd[!is.na(Plant.Start) & 
                         Buffer<50000 &
                         !grepl("[.][.]",Site.ID) & 
                         !grepl("[.][.]",M.Year) &
                         !grepl("-",Product.Simple) & is.na(gdd_subopt),key_fields,with=F]

                Site.Key M.Year Product.Simple Plant.Start  Plant.End
                  <char> <char>         <char>      <Date>     <Date>
 1: 05.4810 07.5370 B300   2001     Crustacean  1985-08-21 1985-08-31
 2: 05.4810 07.5370 B300   2002     Crustacean  1986-08-21 1986-08-31
 3: 05.4810 07.5370 B300   2001     Crustacean  1986-08-21 1986-08-31
 4: 05.4810 07.5370 B300   2001     Crustacean  1987-08-21 1987-08-31
 5: 05.4810 07.5370 B300   2002     Crustacean  1986-08-21 1986-08-31
 6: 05.4810 07.5370 B300   2002     Crustacean  1987-08-21 1987-08-31
 7: 05.4810 07.5370 B300   2001     Crustacean  1987-08-21 1987-08-31
 8: 05.4810 07.5370 B300   2001     Crustacean  1987-08-21 1987-08-31
 9: 05.4810 07.5370 B300   2002     Crustacean  1986-08-21 1986-08-31
10: 05.4810 07.5370 B300   2002     Crustacean  1987-08-21 1987-08-31
    Harvest.Start Harvest.End
           <Date>      <Date>
 1:    1985-10-21  1985-10-30
 2:    1986-10-21  1986-10-30
 3:    1986-10-21  1986-10-30
 4:    1987-10-21  1987-10-30
 5:    1986-10-21  1986-10-30
 6:    1987-10-21  1987-10-30
 7:    1987-10-21  1987-10-30
 8:    1987-10-21  1987-10-30
 9:    1986-10-21  1986-10-30
10:    1987-10-21  1987-10-30

1.8 Foundational datasets

1.8.1 Rainfall

Rainfall data are downloaded using R/add_geodata/functions/download_chirps.R and processed using the script R/add_geodata/chirps.R.

1.8.1.1 Access

The annual and long-term average datasets are small, we can simply download them from the ERA s3 bucket.

# Set the remote S3 path and local save path
s3_data_dir <- "s3://digital-atlas/era/geodata"

# List and filter files
s3<-s3fs::S3FileSystem$new(anonymous = T)
files_s3 <- s3$dir_ls(s3_data_dir)

file_ltavg<-grep("chirps_ltavg.*parquet", files_s3, value = TRUE)
file_annnual<-grep("chirps_annual.*parquet", files_s3, value = TRUE)

# Filter to most recent version of dataset
(file_ltavg <- tail(file_ltavg, 1))

[1] "s3://digital-atlas/era/geodata/chirps_ltavg_2025-04-12.parquet"

(file_annnual <- tail(file_annnual, 1))

[1] "s3://digital-atlas/era/geodata/chirps_annual_2025-04-12.parquet"

files_s3<-c(file_ltavg,file_annnual)

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

for(i in 1:length(files_local)){
  if(!file.exists(files_local[i])){
    s3$file_download(files_s3[i], files_local[i])
  }
}

# Load ltavg and annual data
chirps_ltavg<-arrow::read_parquet(files_local[1])
chirps_annual<-arrow::read_parquet(files_local[2])

The daily CHIRPS dataset id quite large, let’s use the arrow package to download the head of the data only. To learn more about using the arrow package to access parquet data in R see https://arrow.apache.org/docs/r/.
In future ERA updates we will optimize the partition structure of parquet tables to faciliate faster access, in the short-term we suggest working locally with files is still the best option.

# Load head of daily data only
files_s3 <- s3$dir_ls(s3_data_dir)

file_daily<-grep("chirps.*parquet", files_s3, value = TRUE)
file_daily<-file_daily[!grepl("ltavg|annual",file_daily)]
(file_daily <- tail(file_daily, 1))

[1] "s3://digital-atlas/era/geodata/chirps_2025-04-12.parquet"

files_local <- gsub(s3_data_dir, local_data_dir, file_daily)

if(!file.exists(files_local)){

  chirps_daily<-open_dataset(file_daily, format = "parquet", filesystem = s3)

  # Read the first 5 rows into a data.table
  chirps_daily <- as.data.table(head(chirps_daily, 5))
  
  # Save result
  arrow::write_parquet(chirps_daily,files_local)
}else{
  chirps_daily<-arrow::read_parquet(files_local)
}

1.8.1.2 Structure

Daily precipitation

head(chirps_daily) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

date	Rain	Site.Key	day_count
1981-01-01	0.00	-0.0023 34.5939 B300	29585
1981-01-02	0.00	-0.0023 34.5939 B300	29586
1981-01-03	0.00	-0.0023 34.5939 B300	29587
1981-01-04	0.00	-0.0023 34.5939 B300	29588
1981-01-05	22.81	-0.0023 34.5939 B300	29589

Annual precipitation

head(chirps_annual) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Year	Total.Rain
-0.0023 34.5939 B300	1981	1585.75
-0.0108 36.9617 B250	1981	836.60
-0.0333 34.8000 B917	1981	1577.04
-0.0333 37.8333 B917	1981	1424.46
-0.0420 34.5920 B12500	1981	1467.24
-0.0620 34.2290 B30000	1981	1232.72

Long-term average precipitation

head(chirps_ltavg) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Total.Rain.mean	Total.Rain.sd	Total.Rain
-0.0023 34.5939 B300	1794.8684	281.26	1794.87
-0.0108 36.9617 B250	758.7884	153.56	758.79
-0.0333 34.8000 B917	1707.8786	273.12	1707.88
-0.0333 37.8333 B917	1249.3691	311.05	1249.37
-0.0420 34.5920 B12500	1677.6044	261.69	1677.60
-0.0620 34.2290 B30000	1377.4658	218.68	1377.47

1.8.2 POWER

POWER data are downloaded using R/add_geodata/functions/download_power.R and processed using the script R/add_geodata/power.R.

1.8.2.1 Access

The annual and long-term average datasets are small, we can simply download them from the ERA s3 bucket.

# Set the remote S3 path and local save path
s3_data_dir <- "s3://digital-atlas/era/geodata"

# List and filter files
s3<-s3fs::S3FileSystem$new(anonymous = T)
files_s3 <- s3$dir_ls(s3_data_dir)

file_ltavg<-grep("POWER_ltavg.*parquet", files_s3, value = TRUE)
file_annnual<-grep("POWER_annual.*parquet", files_s3, value = TRUE)

# Filter to most recent version of dataset
(file_ltavg <- tail(file_ltavg, 1))

[1] "s3://digital-atlas/era/geodata/POWER_ltavg_2025-04-12.parquet"

(file_annnual <- tail(file_annnual, 1))

[1] "s3://digital-atlas/era/geodata/POWER_annual_2025-04-12.parquet"

files_s3<-c(file_ltavg,file_annnual)

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

for(i in 1:length(files_local)){
  if(!file.exists(files_local[i])){
    s3$file_download(files_s3[i], files_local[i])
  }
}

# Load ltavg and annual data
power_ltavg<-arrow::read_parquet(files_local[1])
power_annual<-arrow::read_parquet(files_local[2])

files_s3 <- s3$dir_ls(s3_data_dir)

file_daily<-grep("POWER.*parquet", files_s3, value = TRUE)
file_daily<-file_daily[!grepl("ltavg|annual",file_daily)]
(file_daily <- tail(file_daily, 1))

[1] "s3://digital-atlas/era/geodata/POWER_2025-04-12.parquet"

files_local <- gsub(s3_data_dir, local_data_dir, file_daily)

if(!file.exists(files_local)){

    s3$file_download(file_daily, files_local)
  
  # Subset to the first 5 rows
  power_daily <-arrow::read_parquet(files_local)[1:5]
  
  # Save result
  arrow::write_parquet(power_daily,files_local)
}else{
  power_daily<-arrow::read_parquet(files_local)
}

1.8.2.2 Structure

Daily power

head(power_daily) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Year	Day	Pressure.Corrected	WindSpeed	Specific.Humid	Temp.Min	Humid	Temp.Max	Temp.Mean	Pressure	SRad	Rain	Latitude	Longitude	Altitude	ETo	Date	DayCount
-0.0023 34.5939 B300	1984	1	84.72	1.52	12.16	16.06	64.75	28.67	22.50	87.10	22.02	0.01	-0.0023	-0.0023	1515.043	4.54	1984-01-01	30680
-0.0023 34.5939 B300	1984	2	84.76	1.67	12.34	16.98	63.54	28.30	22.82	87.14	21.16	0.00	-0.0023	-0.0023	1515.043	4.51	1984-01-02	30681
-0.0023 34.5939 B300	1984	3	84.80	2.42	12.43	17.08	67.59	28.18	22.01	87.17	21.43	0.00	-0.0023	-0.0023	1515.043	4.66	1984-01-03	30682
-0.0023 34.5939 B300	1984	4	84.78	1.63	12.18	15.99	66.83	28.29	21.84	87.15	22.30	0.00	-0.0023	-0.0023	1515.043	4.54	1984-01-04	30683
-0.0023 34.5939 B300	1984	5	84.62	2.08	10.42	15.68	57.07	28.73	21.98	87.00	24.13	0.00	-0.0023	-0.0023	1515.043	5.20	1984-01-05	30684

Annual power

head(power_annual) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Year	Total.Rain	Total.ETo	S.Humid.Mean	Humid.Mean	Temp.Mean.Mean	Temp.Max.Mean	Temp.Max	Temp.Min.Mean	Temp.Min
-0.0023 34.5939 B300	1984	1662	1692	12.70	67.46	22.41	27.96	35.14	17.57	14.23
-0.0023 34.5939 B300	1985	1762	1613	12.91	70.53	21.90	27.01	34.60	17.54	14.72
-0.0023 34.5939 B300	1986	1630	1642	12.80	68.31	22.36	27.76	33.83	17.70	14.95
-0.0023 34.5939 B300	1987	1912	1600	13.47	71.13	22.38	27.45	32.20	17.88	14.32
-0.0023 34.5939 B300	1988	1999	1636	13.30	69.76	22.56	27.69	35.42	18.10	15.20
-0.0023 34.5939 B300	1989	1789	1582	12.92	69.70	22.06	27.17	32.86	17.50	14.73

Long-term average power

head(power_ltavg) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Total.Rain.Mean	Total.ETo.Mean	S.Humid.Mean	Humid.Mean	Temp.Mean.Mean	Temp.Max.Mean	Temp.Max	Temp.Min.Mean	Temp.Min	Total.Rain.sd	Total.ETo.sd	S.Humid.Mean.sd	Humid.Mean.sd	Temp.Mean.Mean.sd	Temp.Max.Mean.sd	Temp.Max.sd	Temp.Min.Mean.sd	Temp.Min.sd
-0.0023 34.5939 B300	1814.49	1625.00	13.34	70.12	22.53	27.60	34.23	18.10	14.59	345.70	69.05	0.44	2.71	0.38	0.59	1.49	0.31	0.81
-0.0108 36.9617 B250	1218.59	1421.90	10.62	70.84	17.74	24.85	29.40	11.87	8.17	298.93	86.31	0.51	3.58	0.52	0.80	1.35	0.50	0.77
-0.0333 34.8000 B917	1864.63	1681.07	13.38	67.85	23.05	27.37	33.28	19.13	15.10	358.39	70.68	0.42	2.53	0.35	0.52	1.33	0.28	0.95
-0.0333 37.8333 B917	1184.73	1345.34	10.95	74.56	17.30	23.57	27.92	12.59	9.16	297.46	85.20	0.46	3.40	0.48	0.73	1.13	0.42	0.84
-0.0420 34.5920 B12500	1814.49	1619.80	13.34	70.12	22.53	27.60	34.23	18.10	14.59	345.70	69.41	0.44	2.71	0.38	0.59	1.49	0.31	0.81
-0.0620 34.2290 B30000	1873.51	1631.00	13.68	69.94	22.98	26.99	32.85	19.32	15.93	344.15	65.83	0.38	2.25	0.35	0.51	1.33	0.31	0.78

1.8.3 Soilgrids

1.8.3.1 Access

# Set the remote S3 path and local save path
s3_data_dir <- "s3://digital-atlas/era/geodata"

# List and filter files
s3<-s3fs::S3FileSystem$new(anonymous = T)
files_s3 <- s3$dir_ls(s3_data_dir)

files_s3<-files_s3[!grepl("watbal",files_s3)]

file_soilgrids<-grep("soilgrids2.0.*parquet", files_s3, value = TRUE)
file_isda<-grep("isda.*parquet", files_s3, value = TRUE)

# Filter to most recent version of dataset
(file_soilgrids <- tail(file_soilgrids, 1))

[1] "s3://digital-atlas/era/geodata/soilgrids2.0_2025-04-11.parquet"

(file_isda <- tail(file_isda, 1))

[1] "s3://digital-atlas/era/geodata/isda_2025-04-12.parquet"

files_s3<-c(file_isda,file_soilgrids)

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

for(i in 1:length(files_local)){
  if(!file.exists(files_local[i])){
    s3$file_download(files_s3[i], files_local[i])
  }
}

# Load data
# Note the soilgrids data is quite large (>150 Mb) so it will take a few minutes to download
soilgrids<-arrow::read_parquet(files_local[1])
isda<-arrow::read_parquet(files_local[2])

1.8.3.2 Structure

Soil grids

head(soilgrids) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	stat	variable	depth	error	value
-0.0023 34.5939 B300	mean	al	0-20cm	1.1	267.75
-0.0023 34.5939 B300	mean	al	20-50cm	0.8	321.87
-0.0023 34.5939 B300	mean	bdr	0-200cm	21.1	200.00
-0.0023 34.5939 B300	mean	c.tot	0-20cm	NA	22.91
-0.0023 34.5939 B300	mean	c.tot	20-50cm	NA	17.78
-0.0023 34.5939 B300	mean	ca	0-20cm	0.2	933.95

files_s3 <- s3$dir_ls(s3_data_dir)
(files_s3<-grep("soilgrids2.0.*metadata", files_s3, value = TRUE))

[1] "s3://digital-atlas/era/geodata/soilgrids2.0_metadata.csv"

# Replace s3 path with https path so we can read directly into R
http_path <- gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)

soilgrids_metadata<-fread(http_path)

head(soilgrids_metadata) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Name	Description	Mapped units	Conversion factor	Conventional units	source	resolution
bdod	Bulk density of the fine earth fraction	cg/cm^3	100	kg/dm^3	SoilGrids 2.0	250m
cec	Cation Exchange Capacity of the soil	mmol(c)/kg	10	cmol(c)/kg	SoilGrids 2.0	250m
cfvo	Volumetric fraction of coarse fragments (> 2 mm)	cm^3/dm^3 (vol per mil)	10	cm^3/100cm^3 (vol%)	SoilGrids 2.0	250m
clay	Proportion of clay particles (< 0.002 mm) in the fine earth fraction	g/kg	10	g/100g (%)	SoilGrids 2.0	250m
nitrogen	Total nitrogen (N)	cg/kg	100	g/kg	SoilGrids 2.0	250m
phh2o	Soil pH	pH*10	10	pH	SoilGrids 2.0	250m

isda

head(isda) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

variable	value	Site.Key	depth	stat
bdod	1.230000	-1.1720 -80.3920 B400	0-5cm	mean
bdod	1.210476	-1.1720 -80.3920 B400	0-5cm	mean
bdod	1.210000	-1.1720 -80.3920 B400	0-5cm	mean
bdod	1.230000	-1.1720 -80.3920 B400	0-5cm	mean
bdod	1.220114	-1.1720 -80.3920 B400	0-5cm	mean
bdod	1.200457	-1.1720 -80.3920 B400	0-5cm	mean

files_s3 <- s3$dir_ls(s3_data_dir)

Warning in .mapply(list, x, NULL): longer argument not a multiple of length of
shorter

files_s3<-grep("isda.*metadata", files_s3, value = TRUE)

# Replace s3 path with https path so we can read directly into R
http_path <- gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)

isda_metadata<-fread(http_path)

head(isda_metadata) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

var	description	unit
al	extractable aluminum	mg kg-1
bdr	bed rock depth	cm
clay	clay content	%
c.tot	total carbon	kg-1
ca	extractable calcium	mg kg-1
db.od	bulk density	kg m-3

1.8.4 Elevation

1.8.4.1 Access

files_s3 <- s3$dir_ls(s3_data_dir)
files_s3<-grep("elevation.*parquet", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/elevation_2025-04-12.parquet"

# Create local file path and download
files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

  if(!file.exists(files_local)){
    s3$file_download(files_s3, files_local)
  }

[1] "downloaded_data/elevation_2025-04-12.parquet"

elevation<-arrow::read_parquet(files_local)

1.8.4.2 Structure

head(elevation) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	Latitude	Longitude	Buffer	Country	variable	stat	value
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	slope	mean	5.043176
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	slope	sd	2.824392
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	slope	median	4.544371
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	slope	max	15.812020
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	slope	min	0.000000
-0.0023 34.5939 B300	-0.0023	34.5939	300	Kenya	aspect	mean	227.078533

Field Descriptions:

Site.Key: Unique identifier for the ERA site.
Latitude: Geographic latitude of the site (decimal degrees, WGS84).
Longitude: Geographic longitude of the site (decimal degrees, WGS84).
Buffer: Radius (in meters) around the site used for calculating topographic statistics.
Country: Country in which the site is located.
variable: The terrain variable being summarized:
- elevation: Elevation above sea level (meters)
- slope: Terrain slope or steepness (degrees)
- aspect: Orientation of slope (degrees clockwise from North)
stat: Summary statistic applied to the variable within the buffer:
- mean: Mean value
- sd: Standard deviation
- median: Median value
- max: Maximum value
- min: Minimum value
value: The computed result for each variable–stat combination.

1.8.5 SOS

1.8.5.1 Access

files_s3 <- s3$dir_ls(s3_data_dir)

files_s3<-grep("sos_.*RData", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/sos_2025-04-13.RData"

files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

# File size is about 40 MB, so download will take some time depending on your connection
if(!file.exists(files_local)){
    s3$file_download(files_s3, files_local)
}

[1] "downloaded_data/sos_2025-04-13.RData"

# Load sos data
sos<-miceadds::load.Rdata2(file=basename(files_local),path=dirname(files_local))

names(sos)

[1] "Dekadal_SOS"   "Seasonal_SOS2" "LTAvg_SOS2"    "Seasonal_SOS3"
[5] "LTAvg_SOS3"

1.8.5.2 Structure

The start of season (sos) calculations /R/add_geodata/calculate_sos.R process raw climate data to derive robust growing-season indicators at multiple temporal scales. In essence, it integrates high-resolution (dekadal) climate records with monthly and seasonal aggregations to compute metrics such as the start of season (SOS), end of season (EOS), length of growing period (LGP), and total rainfall. This multi-layered approach is designed for informed agricultural planning and climate adaptation analysis.

We do not provide detailed descriptions of all the field present in the sos tables, this can be found in the metadata table. The table we make use of the climate statistic calculations is sos$LTAvg_SOS2 which tells on the average onset of the rainy season (in dekads) for a location.

metadata

files_s3 <- s3$dir_ls(s3_data_dir)
files_s3<-grep("sos.*metadata.*csv", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/sos_metadata.csv"

# Replace s3 path with https path so we can read directly into R
http_path <- gsub("^s3://([^/]+)/(.*)$", "https://\\1.s3.amazonaws.com/\\2", files_s3)

sos_metadata<-fread(http_path)

head(sos_metadata) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Table	Field	Class	Description
Dekadal_SOS	Site.Key	chr	Unique identifier for the site.
Dekadal_SOS	Year	num	The calendar year corresponding to the record.
Dekadal_SOS	Dekad	num	The 10-day period number within the year (typically 1–36).
Dekadal_SOS	Rain.Season	num	Code indicating the identified rainy season for that dekad.
Dekadal_SOS	Rain.Dekad	num	Total rainfall measured during the dekad.
Dekadal_SOS	AI	num	Aridity Index (ratio of rainfall to potential evapotranspiration) for the dekad.

Dekadal_SOS
- Description: Contains detailed, dekadal (approximately 10-day) climate information.
- Key Metrics: Rainfall, potential evapotranspiration (ETo), aridity index (AI), and computed dekad values.
- Purpose: Provides the high-resolution temporal detail needed to identify seasonal transitions and establish baseline climate conditions.
Seasonal_SOS2
- Description: Aggregates dekadal data into a seasonal view focused on the primary growing season.
- Key Metrics:
  - SOS: The first dekad when conditions indicate the start of the season.
  - EOS: The last dekad of the season.
  - LGP: The length of the growing period (calculated as the difference between EOS and SOS).
  - Tot.Rain: Total rainfall during the season.
- Methodology: Uses rolling sums and fixed threshold criteria (e.g., minimum rainfall, aridity conditions) to define the rainy period, with padding applied to manage edge effects.

3.LTAvg_SOS2 - Description: Provides long-term average statistics derived from the primary seasonal data.
- Key Metrics:
- Mean, median, minimum, and maximum SOS values.
- Average EOS, LGP, and average total seasonal rainfall (mean seasonal precipitation MSP).
- Proportions indicating seasonal transitions across calendar years. - Purpose: Summarizes seasonal behavior over the full record, highlighting variability and central tendencies.

head(sos$LTAvg_SOS2[,.(Site.Key,SOS.min,SOS.mean,SOS.max,EOS,LGP,Tot.Rain,Seasons)]) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	SOS.min	SOS.mean	SOS.max	EOS	LGP	Tot.Rain	Seasons
-0.0023 34.5939 B300	4	5.6	9	18	12.4	859.1	2
-0.0023 34.5939 B300	22	22.6	25	36	13.4	751.4	2
-0.0108 36.9617 B250	7	9.1	12	17	7.9	309.5	2
-0.0108 36.9617 B250	28	29.3	32	36	6.7	236.6	2
-0.0333 34.8000 B917	4	5.5	9	18	12.5	817.3	2
-0.0333 34.8000 B917	22	22.5	25	36	13.5	690.4	2

Seasonal_SOS3 & LTAvg_SOS3
- Description: These datasets mirror Seasonal_SOS2 and LTAvg_SOS2 but pertain to an additional (often secondary) growing season, which may be present in more humid regions.
- Key Metrics & Purpose: Similar to the primary season outputs, these capture SOS, EOS, LGP, and rainfall for the secondary season, adding nuance to regions where multiple growing periods exist.

1.8.5.3 Methods

/R/add_geodata/calculate_sos.R

Data Integration: The script merges datasets from the POWER and CHIRPS sources—substituting CHIRPS rainfall into the POWER dataset—to leverage the strengths of both and ensure more accurate rainfall data.
Temporal Aggregation: Daily data are first converted to dekadal values. These are further aggregated into monthly summaries and then into seasonal periods using custom functions (e.g., SOS_Dekad, SOS_SeasonPad).
Threshold-Based Filtering: Fixed criteria (e.g., a minimum rainfall threshold of 200 mm, an aridity index cutoff) are applied to delineate season boundaries. While these thresholds are clearly defined, they prompt a critical question: are they universally applicable, or do they require recalibration for different regions and evolving climate conditions?
Handling Season Transitions: The script manages scenarios where seasons cross calendar boundaries, excludes incomplete years, and applies specific padding rules to balance season lengths. Custom sequence functions (e.g., SOS_UniqueSeq, SOS_SeqMerge) play a key role in ensuring the integrity of the seasonal identification.

Critical Considerations & Forward-Thinking Perspective

Fixed Parameters vs. Regional Flexibility: The use of fixed thresholds (for rainfall and aridity) is straightforward but invites scrutiny. It is important to ask whether these parameters are optimal for all regions, especially in a changing climate. Future iterations might consider adaptive or region-specific thresholds.
Modular and Adaptable Structure: The script’s modular design—with separate outputs for dekadal details, seasonal summaries, and long-term averages—allows for flexibility. This structure facilitates updates and refinements, such as integrating more dynamic statistical methods or machine learning approaches to adjust thresholds.
Robustness in Data Quality: Steps to exclude incomplete data and adjust for season transitions add robustness, but constant validation against observed ground conditions is critical for long-term reliability.

The output datasets provide a comprehensive picture of growing season dynamics: - Dekadal_SOS offers a fine-scale temporal resolution. - Seasonal_SOS2 and LTAvg_SOS2 capture the primary and secondary growing season’s characteristics. - Seasonal_SOS3 and LTAvg_SOS3 extend this analysis to potential thrid season.

This structure supports detailed analysis and decision-making in agricultural and climate adaptation planning. However, while the methodology is thorough, questioning the fixed thresholds and continuously validating the approach against real-world data remains essential for maintaining relevance in a forward-thinking, dynamic climate context.

1.8.6 AEZ

1.8.6.1 Access

files_s3 <- s3$dir_ls(s3_data_dir)

files_s3<-grep("aez_.*parquet", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/aez_2025-04-11.parquet"

files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

if(!file.exists(files_local)){
    s3$file_download(files_s3, files_local)
}

[1] "downloaded_data/aez_2025-04-11.parquet"

aez<-arrow::read_parquet(files_local)

1.8.6.2 Structure

head(aez) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Latitude	Longitude	Site.Key	Buffer	prop	value	dataset	value_cat
-0.0023	34.5939	-0.0023 34.5939 B300	300	1.00	324	004_afr-aez_09.tif	Tropic - cool / humid
-0.0108	36.9617	-0.0108 36.9617 B250	250	1.00	323	004_afr-aez_09.tif	Tropic - cool / subhumid
-0.0333	34.8000	-0.0333 34.8000 B917	917	1.00	324	004_afr-aez_09.tif	Tropic - cool / humid
-0.0333	37.8333	-0.0333 37.8333 B917	917	1.00	313	004_afr-aez_09.tif	Tropic - warm / subhumid
-0.0420	34.5920	-0.0420 34.5920 B12500	12500	0.89	324	004_afr-aez_09.tif	Tropic - cool / humid
-0.0620	34.2290	-0.0620 34.2290 B30000	30000	0.55	314	004_afr-aez_09.tif	Tropic - warm / humid

Field Descriptions:

Latitude: Geographic latitude of the site (decimal degrees, WGS84).
Longitude: Geographic longitude of the site (decimal degrees, WGS84).
Site.Key: Unique site identifier used throughout ERA.
Buffer: Radius (in meters) used to extract AEZ values from raster data.
prop: Proportion of the buffer area covered by the dominant AEZ category.
value: Numeric AEZ class code assigned by the source dataset.
dataset: The AEZ raster dataset used, e.g., "004_afr-aez_09.tif" or "AEZ8_CLAS--SSA.tif".
value_cat: Human-readable label for the AEZ zone, derived from the class value using an external key or metadata file (e.g., "Tropic - cool / humid").

1.8.7 Soil Moisture

Daily soil moisture balance is calculated using a simple water balance model implemented in water_balance.R.
This model simulates daily soil water availability, evaporative demand, and logging risk for each site, based on rainfall (CHIRPS), temperature and radiation (NASA POWER), and soil properties derived from ISDA (Africa) or SoilGrids 2.0 (non-Africa). These daily values are the foundation for seasonal summaries of water stress (ERATIO) and excess moisture (LOGGING).

ERATIO (Evaporative Ratio) is the ratio of actual evapotranspiration (Ea) to potential evapotranspiration (Ep).
- Values near 1 indicate sufficient water availability—plants are able to meet atmospheric demand.
- Values < 0.5 suggest moderate to severe water stress, where crop water needs are not being met.
- Daily ERATIO values are used to summarize frequency, duration, and intensity of drought conditions across a season.
LOGGING represents the amount of water in the soil above field capacity but below saturation.
- Positive values indicate periods where excess water may restrict oxygen availability to roots (i.e., waterlogging).
- Used to flag moisture stress due to excess rainfall or poor drainage.

1.8.7.1 Access

files_s3 <- s3$dir_ls(s3_data_dir)

# Substitute isda for soilgrids to access the isda soil grids data (for african site)
# Hear we are using soilgrids2.0 because the file size is more convinient for this vignette
files_s3<-grep("watbal.*soilgrids.*parquet", files_s3, value = TRUE)
(files_s3 <- tail(files_s3, 1))

[1] "s3://digital-atlas/era/geodata/watbal-soilgrids2.0_2025-04-13.parquet"

files_local <- gsub(s3_data_dir, local_data_dir, files_s3)

# File size is about x MB, so download will take some time depending on your connection
if(!file.exists(files_local)){
    s3$file_download(files_s3, files_local)
}

[1] "downloaded_data/watbal-soilgrids2.0_2025-04-13.parquet"

watbal<-arrow::read_parquet(files_local)

1.8.7.2 Structure

head(watbal) |>
  kable(format = "html") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"), position = "left") |>
  scroll_box(width = "100%", height = "250px")

Site.Key	scp	ssat	DATE	TMIN	TMAX	TMEAN	SRAD	ETMAX	DEMAND	ERATIO
-1.1720 -80.3920 B400	32.29	2.65	1984-01-01	22.5	30.3	25.4	19.1	6.62	0.09	0.01
-1.1720 -80.3920 B400	32.29	2.65	1984-01-02	22.2	30.4	25.5	15.6	5.49	0.07	0.01
-1.1720 -80.3920 B400	32.29	2.65	1984-01-03	22.6	29.2	25.0	17.6	5.85	0.08	0.01
-1.1720 -80.3920 B400	32.29	2.65	1984-01-04	22.6	27.6	24.5	18.4	5.74	0.08	0.01
-1.1720 -80.3920 B400	32.29	2.65	1984-01-05	22.4	27.7	24.4	17.1	5.37	0.07	0.01
-1.1720 -80.3920 B400	32.29	2.65	1984-01-06	22.5	28.4	24.7	16.9	5.42	0.07	0.01

Each row represents a unique site-day combination.

Field Descriptions:

Site.Key: Unique identifier for the ERA site.
scp: Soil water holding capacity at field capacity (mm). Estimated from ISDA or SoilGrids based on pedotransfer rules.
ssat: Soil saturation point (mm). Maximum amount of water the soil can hold.
DATE: Observation date (daily time step).
TMIN: Minimum daily air temperature (°C), from NASA POWER.
TMAX: Maximum daily air temperature (°C), from NASA POWER.
TMEAN: Mean daily air temperature (°C).
RAIN: Daily precipitation (mm), from CHIRPS.
SRAD: Surface solar radiation (MJ/m²/day), from NASA POWER.
ETMAX: Potential evapotranspiration (PET, mm/day), calculated using the Priestley–Taylor method.
AVAIL: Estimated soil water available to crops (mm). Simulated daily from soil and rainfall inputs.
DEMAND: Crop water demand (mm). Equal to ETMAX if water is not limiting.
ERATIO: Evaporative ratio (Ea/Ep) — actual evapotranspiration divided by PET. A proxy for crop water stress.
LOGGING: Simulated waterlogging value (mm above field capacity, but below saturation).
RUNOFF: Excess rainfall (mm) beyond soil saturation capacity; assumed to be lost as runoff or deep percolation.

1.8.8 Bioclim

This will be made available in future updates. If you have a critical need for this information then please contact the ERA team and we can prioritize these data.

2 Acknowledgements

These open-source scripts were delivered for and funded by the Agroecology in the Dry Corridor of Central America (ACDC) project

3 Contact Us

For more details or to explore collaborative opportunities:

Please visit our GitHub repository: https://github.com/ERAgriculture/ERA_Agronomy.git

Or contact: Peter Steward (Scientist II): p.steward@cgiar.org
Namita Joshi (Senior Research Associate): n.joshi@cgiar.org
Todd Rosenstock (Principal Scientist): t.rosenstock@cgiar.org

1 How ERA Connects to Geospatial Climate and Soils Data

1.1 Data Sources

1.1.1 CHIRPS (Rainfall)

1.1.2 POWER (NASA)

1.2 Download Function: R/add_geodata/functions/download_power.R

1.2.1 Soil Data Sources

1.2.1.1 iSDAsoil (Africa only)

1.2.1.2 SoilGrids 2.0 (Non-Africa)

1.3 > We plan to extend SoilGrids 2.0 to African sites in a future update, allowing for harmonized coverage across all regions.

1.3.1 AEZ (Agro-Ecological Zones)

1.3.2 Elevation (DEM)

1.3.3 Water Balance & Onset of Rain

1.4 Methods

1.5 These derived indicators provide a biophysically relevant summary of climate exposure tailored to the actual growing period of each crop, making them more actionable than raw daily data or long-term averages.

1.5.1 Downloading the climate data

1.6 Climate data content and structure

1.6.1 Unique locations and times (clim_data$site_data)

1.6.1.1 Site, year, season, & study

1.6.1.2 Crops

1.6.1.3 Planting dates

1.6.1.4 Season length

1.6.2 Climate data (PDate.SLen.Data, PDate.SLen.EcoCrop, PDate.SLen.P30)

1.6.2.1 shared fields (index or key fields)

1.6.2.2 gdd

1.6.2.3 temperature

1.6.2.4 rainfall

1.6.2.5 eratio

1.6.2.6 logging

1.7 Connecting climate stats back to the ERA database

1.7.1 ERA Comparisons Table

1.8 Foundational datasets

1.8.1 Rainfall

1.8.1.1 Access

1.8.1.2 Structure

1.8.2 POWER

1.8.2.1 Access

1.8.2.2 Structure

1.8.3 Soilgrids

1.8.3.1 Access

1.8.3.2 Structure

1.8.4 Elevation

1.8.4.1 Access

1.8.4.2 Structure

1.8.5 SOS

1.8.5.1 Access

1.8.5.2 Structure

1.8.5.3 Methods

1.8.6 AEZ

1.8.6.1 Access

1.8.6.2 Structure

1.8.7 Soil Moisture

1.8.7.1 Access

1.8.7.2 Structure

1.8.8 Bioclim

2 Acknowledgements

3 Contact Us

1.2 Download Function: `R/add_geodata/functions/download_power.R`

1.6.1 Unique locations and times (`clim_data$site_data`)

1.6.2 Climate data (`PDate.SLen.Data, PDate.SLen.EcoCrop, PDate.SLen.P30`)