Ingredient classification

1. Purpose

The goal of this script is to develop and document a suite of deterministic rule based algorithms that assign individual feed ingredients to one of six nutritional categories (I – VI) on the basis of routine laboratory compositional data. Each algorithm embodies a distinct set of decision rules.

2. Baseline classification

Variable	Category I (High‑starch)	Category II (High‑fat)	Category III (Protein)	Category IV (Good forage)	Category V (Poor forage)	Category VI (Mineral/Vit)
GE (MJ kg⁻¹ DM)	170 – 190	300 – 380	180 – 200	160 – 180	150 – 170	—
Ash (g kg⁻¹ DM)	20 – 40	0 – 10	50 – 80	70 – 100	60 – 100	≥ 500
CP (g kg⁻¹ DM)	80 –120	0 – 50	350 – 500	120 – 200	30 – 80	≈ 0
EE (g kg⁻¹ DM)	20 – 50	800 – 990	20 – 60	10 – 30	0 – 20	< 1
NDF (g kg⁻¹ DM)	100 – 200	0 – 50	100 – 300	400 – 500	600 – 750	—

3. Classification methods

Method	R function	Core concept	Variables needed	One‑sentence rule
1	`method1()`	Exact‑window filter	GE, Ash, CP, EE, NDF (all present)	Assign to the first category in which every variable falls inside its strict baseline window.
2	`method2()`	Exact ± 10 % filter	Same five	Same as M1 but each numeric bound is widened ± 10 %.
3	`method3()`	Priority‑variable filter (strict)	Only the priority variables per category: • I → NDF • II → EE • III–V → CP & NDF • VI → Ash	If all priority variables exist and meet their strict windows, assign that category; missing non‑priority data are ignored.
4	`method4()`	Priority‑variable ± 10 %	Same	Priority windows widened ± 10 %; non‑priority values (if present) still use strict windows and can only disqualify.
5	`method5()`	Hierarchical decision tree (strict)	Any subset of GE, Ash, CP, EE, NDF	Gates tested in order: `EE ≥ 80 → II` → else `Ash ≥ 50 → VI` → else `CP ≥ 35 & NDF ≤ 30 → III` → else `NDF ≤ 20 → I` → else `NDF ≥ 60 → V` → else `40 ≤ NDF ≤ 50 → IV` → else unclassified.
6	`method6()`	Hierarchical tree ± 10 %	Same	Same gates as M5 but every numeric constant is multiplied by 0.9 (lower) and 1.1 (upper) before testing.
7	`method7()`	Hierarchical ± 10 % plus text overrides	Same numerics + `D.Item`, `D.Type`	Start with Method 6 result, then override: • `oil\|tallow\|fat\|grease` in D.Item → II • protein‑meal regex (distillers grain, soybean meal, …) → III • `D.Type` ∈ “Supplement” or matches “supplement” → VI

All concentration variables are in g·kg⁻¹ DM except GE (MJ·kg⁻¹ DM).

4. Hierachical decision tree

library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(readr)
library(utils)
library(DT)

Ing_classif <- read_csv("Ing_classif.csv")

Ing_classif%>%
  filter(is_entire_diet=="FALSE")

## # A tibble: 394,966 × 11
##     ...1 B.Code   D.Item      is_entire_diet DC.Variable DC.Value DC.Unit Source
##    <dbl> <chr>    <chr>       <lgl>          <chr>          <dbl> <chr>   <chr> 
##  1     1 NN0137.2 0 Finisher  FALSE          DM                NA g/kg    lives…
##  2     2 NN0137.2 0 Grower    FALSE          DM                NA g/kg    lives…
##  3    24 NN0396   0% finisher FALSE          DM                NA g/kg    lives…
##  4    25 NN0396   0% starter  FALSE          DM                NA g/kg    lives…
##  5    59 NN0396   100% finis… FALSE          DM                NA g/kg    lives…
##  6    60 NN0396   100% start… FALSE          DM                NA g/kg    lives…
##  7    78 DK0001   16-20 wk S… FALSE          DM                NA g/kg    lives…
##  8    79 DK0001   16-24 Fora… FALSE          DM                NA g/kg    lives…
##  9    88 NN0137.2 2 Finisher  FALSE          DM                NA g/kg    lives…
## 10    89 NN0137.2 2 Grower    FALSE          DM                NA g/kg    lives…
## # ℹ 394,956 more rows
## # ℹ 3 more variables: Feedipedia <dbl>, nutrition_source <chr>, ilri_code <chr>

diets <- read_csv("diets.csv")

Ing_classif <- Ing_classif %>% 
  filter(!D.Item %in% c("Salt", "Water"))

## ───────────────────────────────────────────────────────────────────────
##  0 ·  Packages
## ──────────────────────────────────────────────────────────────────────


## ───────────────────────────────────────────────────────────────────────
##  1 ·  Build a lookup of ingredient “type” from the diets table
## ───────────────────────────────────────────────────────────────────────
type_lookup <- diets %>%                               # your diets data frame
  select(B.Code, D.Item, D.Type) %>% 
  distinct()

## ───────────────────────────────────────────────────────────────────────
##  2 ·  Pivot Ing_classif to wide  (one row = one ingredient)
##       and join type information
## ───────────────────────────────────────────────────────────────────────
nutr_wide <- Ing_classif %>% 
  filter(is_entire_diet=="FALSE")%>%
  filter(DC.Variable %in% c("CP","EE","NDF","Ash","GE")) %>% 
  mutate(DC.Value = as.numeric(DC.Value)) %>% 
  pivot_wider(
    id_cols      = c(B.Code, D.Item),
    names_from   = DC.Variable,
    values_from  = DC.Value,
    values_fn    = list(DC.Value = mean),
    values_fill  = NA
  ) %>% 
  left_join(type_lookup, by = c("B.Code", "D.Item"))

## ───────────────────────────────────────────────────────────────────────
##  3 ·  Numeric windows  (g kg‑1 DM except GE in MJ kg‑1 DM)
## ───────────────────────────────────────────────────────────────────────
win <- list(
  GE  = list(I = c(170, 190), II = c(300, 380), III = c(180, 200),
             IV = c(160, 180), V = c(150, 170)),
  Ash = list(I = c( 20,  40), II = c(  0,  10), III = c( 50,  80),
             IV = c( 70, 100), V = c( 60, 100), VI = c(500, 1000)),
  CP  = list(I = c( 80, 120), II = c(  0,  50), III = c(350, 500),
             IV = c(120, 200), V = c( 30,  80)),
  EE  = list(I = c( 20,  50), II = c(800, 990), III = c( 20,  60),
             IV = c( 10,  30), V = c(  0,  20)),
  NDF = list(I = c(100, 200), II = c(  0,  50), III = c(100, 300),
             IV = c(400, 500), V = c(600, 750))
)
expand10 <- function(x) c(x[1]*0.9, x[2]*1.1)

## ───────────────────────────────────────────────────────────────────────
##  4 ·  Priority variable lists  (Methods 3 & 4)
## ───────────────────────────────────────────────────────────────────────
priority <- list(
  I  = c("NDF"),
  II = c("EE"),
  III= c("CP", "NDF"),
  IV = c("NDF", "CP"),
  V  = c("NDF", "CP"),
  VI = c("Ash")
)

## ───────────────────────────────────────────────────────────────────────
##  5 ·  Regex patterns for overrides (Method 7)
## ───────────────────────────────────────────────────────────────────────
regex_name <- list(
  II = "\\b(oil|tallow|fat|grease)\\b",                                             # Cat II
  III= "\\b(distillers?.*grain|soybean meal|canola meal|fish meal|blood meal)\\b"   # Cat III
)
vi_types   <- c("Supplement", "Non-ERA Feed")                                       # Cat VI
vi_regex   <- regex("supplement", ignore_case = TRUE)

## ───────────────────────────────────────────────────────────────────────
##  6 ·  Helper utilities
## ───────────────────────────────────────────────────────────────────────
inside <- function(x, rng) !is.na(x) & x >= rng[1] & x <= rng[2]

## ───────────────────────────────────────────────────────────────────────
##  7 ·  Seven classification methods (no NFC)
## ───────────────────────────────────────────────────────────────────────
method1 <- function(df, w = win) {
  df %>% mutate(
    M1_Category = case_when(
      inside(GE,  w$GE$I)  & inside(Ash, w$Ash$I)  &
      inside(CP,  w$CP$I)  & inside(EE,  w$EE$I)   &
      inside(NDF, w$NDF$I)                         ~ "I",
      inside(GE,  w$GE$II) & inside(Ash, w$Ash$II) &
      inside(CP,  w$CP$II) & inside(EE,  w$EE$II)  &
      inside(NDF, w$NDF$II)                        ~ "II",
      inside(GE,  w$GE$III)& inside(Ash, w$Ash$III)&
      inside(CP,  w$CP$III)& inside(EE,  w$EE$III) &
      inside(NDF, w$NDF$III)                       ~ "III",
      inside(GE,  w$GE$IV) & inside(Ash, w$Ash$IV) &
      inside(CP,  w$CP$IV) & inside(EE,  w$EE$IV)  &
      inside(NDF, w$NDF$IV)                        ~ "IV",
      inside(GE,  w$GE$V)  & inside(Ash, w$Ash$V)  &
      inside(CP,  w$CP$V)  & inside(EE,  w$EE$V)   &
      inside(NDF, w$NDF$V)                         ~ "V",
      inside(Ash, w$Ash$VI)                        ~ "VI",
      TRUE                                         ~ NA_character_
    ))
}
method2 <- function(df) {
  w10 <- map(win, ~ map(.x, expand10))
  method1(df, w10) %>% rename(M2_Category = M1_Category)
}

method_priority <- function(df, w, out_name){
  df %>% rowwise() %>% mutate(
    "{out_name}" := {
      dat <- c(CP=CP, EE=EE, NDF=NDF, Ash=Ash, GE=GE)
      cat <- NA_character_
      for(cl in names(priority)) {
        req <- priority[[cl]]
        if(all(!is.na(dat[req])) &&
           all(map2_lgl(dat[req], req, \(v,r) inside(v, w[[r]][[cl]]))))
        { cat <- cl; break }
      }
      cat
    }) %>% ungroup()
}
method3 <- function(df) method_priority(df, win,  "M3_Category")
method4 <- function(df){
  w10 <- map(win, ~ map(.x, expand10))
  method_priority(df, w10, "M4_Category")
}

tree_decide <- function(df, w){
  df %>% mutate(
    Category = case_when(
      EE  >= w$EE$II [1]               ~ "II",
      Ash >= w$Ash$VI[1]               ~ "VI",
      CP  >= w$CP$III[1] & NDF <= w$NDF$III[2] ~ "III",
      NDF <= w$NDF$I[2]                ~ "I",
      NDF >= w$NDF$V[1]                ~ "V",
      NDF >= w$NDF$IV[1] & NDF <= w$NDF$IV[2]  ~ "IV",
      TRUE                             ~ NA_character_
    )
  )
}
method5 <- function(df) tree_decide(df, win) %>% rename(M5_Category = Category)
method6 <- function(df){
  w10 <- map(win, ~ map(.x, expand10))
  tree_decide(df, w10) %>% rename(M6_Category = Category)
}

method7 <- function(df) {

  # ------------------------------------------------------------------
  # 1 · Ensure we have M6_Category
  # ------------------------------------------------------------------
  out <- if ("M6_Category" %in% names(df)) {
           df                       # already present → keep as‑is
         } else {
           method6(df)              # not present     → run method6()
         }

  # ------------------------------------------------------------------
  # 2 · Start M7 from M6
  # ------------------------------------------------------------------
  out$M7_Category <- out$M6_Category

  # ------------------------------------------------------------------
  # 3 · Name‑based overrides  (Cat II oils, Cat III protein meals)
  # ------------------------------------------------------------------
  oil_pat   <- "\\b(oil|tallow|fat|grease)\\b"
  protein_pat <- "\\b(distillers?.*grain|soybean meal|canola meal|fish meal|blood meal)\\b"

  out$M7_Category[str_detect(out$D.Item, regex(oil_pat, ignore_case = TRUE))]   <- "II"
  out$M7_Category[str_detect(out$D.Item, regex(protein_pat, ignore_case = TRUE))] <- "III"

  # ------------------------------------------------------------------
  # 4 · Type‑based override for supplements / minerals (→ Cat VI)
  # ------------------------------------------------------------------
  sup_types <- c("Supplement", "Non-ERA Feed")
  vi_hit <- !is.na(out$D.Type) &
            (out$D.Type %in% sup_types |
             str_detect(out$D.Type, regex("supplement", ignore_case = TRUE)))

  out$M7_Category[vi_hit] <- "VI"

  out
}


## ───────────────────────────────────────────────────────────────────────
##  8 ·  Run all methods and inspect
## ───────────────────────────────────────────────────────────────────────
classif_all <- nutr_wide %>% 
  method1() %>% 
  method2() %>% 
  method3() %>% 
  method4() %>% 
  method5() %>% 
  method6() %>% 
  method7()

datatable(
  classif_all %>% arrange(M7_Category),
  caption  = "Classification of ingredients (sorted by M7 category)",
  rownames = FALSE,
  options  = list(pageLength = 30, dom = "tip")
)

## 1 · list of method‑columns
method_cols <- grep("^M[1-7]_Category$", names(classif_all), value = TRUE)

## 2 · Coverage table  (one row per method)
coverage_tbl <- map_dfr(method_cols, function(mc){
  tibble(
    Method    = sub("_Category$", "", mc),
    Total     = nrow(classif_all),
    Assigned  = sum(!is.na(classif_all[[mc]])),
    Coverage  = round(Assigned / Total * 100, 1)  # %
  )
})

## 3 · Counts per category
counts_tbl <- classif_all %>% 
  pivot_longer(all_of(method_cols),
               names_to = "Method",
               values_to = "Category") %>% 
  filter(!is.na(Category)) %>% 
  mutate(Method = sub("_Category$", "", Method)) %>% 
  count(Method, Category, name = "n") %>% 
  arrange(Method, desc(n))

## 4 · Ten random ingredients per category
set.seed(123)                       # for reproducibility
sample_tbl <- classif_all %>% 
  pivot_longer(all_of(method_cols),
               names_to = "Method",
               values_to = "Category") %>% 
  filter(!is.na(Category)) %>% 
  mutate(Method = sub("_Category$", "", Method)) %>% 
  group_by(Method, Category) %>% 
  slice_sample(n = 10, replace = FALSE) %>% 
  summarise(Examples = paste(D.Item, collapse = "; "), .groups = "drop") %>% 
  arrange(Method, Category)


## ───────────────────────────────────────────────────────────────────────
##  6 ·  Top‑10 “Unclassified” ingredients per method
## ───────────────────────────────────────────────────────────────────────
unclassified_tbl <- classif_all %>% 
  pivot_longer(all_of(method_cols),
               names_to  = "Method",
               values_to = "Category") %>% 
  filter(is.na(Category)) %>%          # keep only the NAs
  mutate(Method = sub("_Category$", "", Method)) %>% 
  count(Method, D.Item, sort = TRUE) %>%   # how many records of each item
  group_by(Method) %>% 
  slice_max(n, n = 10, with_ties = FALSE) %>%   # top‑10 per method
  ungroup()

## ───────────────
## 5 · Inspect – interactive view with DT
## ───────────────
library(DT)

# Coverage by method
datatable(
  coverage_tbl,
  caption = "Coverage (number and % of ingredients classified)",
  rownames = FALSE,
  options = list(pageLength = 10, dom = "tip")
)

# Counts per category (long format)
datatable(
  counts_tbl,
  caption = "Counts of ingredients in each category per method",
  rownames = FALSE,
  options = list(pageLength = 25, dom = "tip")
)

# Ten random examples per (Method, Category)
datatable(
  sample_tbl,
  caption = "Random sample (≤10) of ingredients in every category",
  rownames = FALSE,
  options = list(pageLength = 25, dom = "tip")
)

# Top‑10 most frequent unclassified ingredients per method
datatable(
  unclassified_tbl,
  caption = "Most frequent unclassified ingredients (top‑10 per method)",
  rownames = FALSE,
  options = list(pageLength = 25, dom = "tip")
)

Ingredient classification

2025-07-22

1. Purpose

2. Baseline classification

3. Classification methods

4. Hierachical decision tree