This vignette provides a reproducible example of how to harness OpenAlex for generating comprehensive search terms and evaluating the resulting references against established databases such as Web of Science (WoS) and Scopus. In this vignette, we will:

  1. Introduce OpenAlex: Overview its features and explain its role in modern scholarly research.

  2. Generate Search Strings: Develop robust and reproducible search strategies tailored to your research needs.For this vingette, we will use the example of agroecology in Latin America and the Caribbean.

  3. Querying OpenAlex: Demonstrate how to interact with the OpenAlex API to retrieve relevant references.

  4. Testing Results Against WoS

Introduction to OpenAlex

OpenAlex is an open source platform that offers comprehensive, freely accessible information about research publications, authors, and institutions. Introduced in January 2022 by OurResearch, OpenAlex was developed as a successor to the now-discontinued Microsoft Academic Graph (MAG). Although it does not encompass every feature of MAG—for instance, patent data is not included—OpenAlex significantly expands on its predecessor’s scope (Priem et al., 2022).

A key strength of OpenAlex is its open licensing model, which ensures that all data, code, and related tools are publicly available. This commitment to openness promotes transparency, reproducibility, and innovation in research practices (Priem et al., 2022). As a result, many academic institutions are increasingly turning to OpenAlex as a viable alternative to proprietary bibliometric tools. For example, Sorbonne University has recently transitioned from using traditional resources such as Web of Science to leveraging OpenAlex and other open-source solutions, in line with its broader policy of openness and sustainable research practices (Scheidsteger & Haunschild, 2022; Culbert et al., 2024)

Furthermore, OpenAlex incorporates advanced artificial intelligence techniques, including natural language processing and machine learning, to enhance the quality and relevance of its metadata (Priem et al., 2022). In some instances, these AI-driven methods leverage large language models to support tasks such as automatic classification, entity disambiguation, and semantic search. Such applications of AI help ensure that the platform remains a cutting-edge resource for research.

Generate Search Strings

This code is designed to construct a structured search query for OpenAlex by defining multiple categories of search terms and formatting them into a query string that follows the OpenAlex API syntax.

1.Load Required Packages

It first ensures that the required packages (openalexR, and data.table) are installed and loaded.

if (!requireNamespace("pacman", quietly = TRUE)) {
  install.packages("pacman")
}
pacman::p_load(
  openalexR, data.table,readxl,dplyr,ggplot2, plotly
)

2.Define Search Term Categories:

The code creates four vectors containing terms related to different aspects of the search:

outcomes_terms: Terms related to impacts, benefits, and economic analysis.

practices_terms: Terms related to different agroecological practices.

product_terms: Terms referring to specific crops or livestock.

region_terms: Terms referring to specific geographical locations.

# Define your search term vectors
outcomes_terms <- c(
  "Benefit-cost analysis", "Break-even period", "Economic valuation",
  "Economic impact", "Net present value", "Payback period", "Willingness to pay",
  "Gross margin","Kilograms per hectare","Cost",
  "Climate change", "Risk assessment", "Vulnerability assessment",
  "Adaptive management", "Community awareness", "Drought resistance","Indigenous knowledge",
  "Changing climate","Flood resistance", "Soil fertility", "Soil degradation", 
  "Biodiversity","Soil degradation","Yield loss cost","Revenue","Yield",
  "Food security", "Water availability", "Water stress"
)

practices_terms <- c(
  "Agroforestry", "Evergreen agriculture", "Farmer managed natural regeneration",
  "Silvopastoral systems", "Alley cropping", "Crop rotation", "Cover cropping",
  "Integrated pest management", "No tillage","Minimum tillage", "Soil conservation",
  "Adapted cultivar","Crop diversification","Crop rotation","Green manure", 
  "Water harvesting", "Managed grazing", "Irrigation","Residue retention",
  "Rotational grazing", "Intensive grazing", "Biological pest control", "Managed grazing",
  "Organic fertilizers","Short duration grazing","Strip grazing"
)

product_terms <- c(
  "Maize", "Zea mays", "Common beans", "Phaseolus vulgaris", "Cattle", "Bos taurus",
  "Bos indicus", "Bull", "Heifer", "Bovine", "Coffee", "Coffea arabica", "Coffea robusta")

region_terms <- c(
  "Belize", "Bolivia", "Brazil", "Colombia", "Costa Rica", "Cuba", "Dominica",
  "Dominican Republic", "Ecuador", "El Salvador", "Grenada", "Guatemala",
  "Guyana", "Haiti", "Honduras", "Jamaica", "Mexico", "Nicaragua", "Paraguay",
  "Peru", "Saint Lucia", "Saint Vincent and the Grenadines", "Suriname",
  "Venezuela", "Argentina", "Bahamas", "Barbados", "Chile", "Panama",
  "Trinidad and Tobago", "Uruguay"
)

3.Format Terms for OpenAlex Query:

Each set of terms is wrapped in double quotes and joined using ” OR ” to follow OpenAlex’s boolean search syntax.

What is a boolean?

A Boolean refers to a system of logic that operates with true/false values. In the context of search queries (like in OpenAlex/WoS/Scopus), Boolean logic allows us to combine multiple search terms using logical operators like:

Common Boolean Operators:

AND → All conditions must be met
Example: “climate change” AND “crop yield”. This will return results only if they contain both terms.

OR → Any condition can be met
Example: “wheat” OR “maize”. This will return results if either term appears.

NOT → Excludes specific terms
Example: “drought” NOT “irrigation”. This will return results that contain “drought” but exclude “irrigation”.

Using Parentheses () → To group logic properly
Example: (“climate adaptation” OR “drought resistance”) AND “agriculture”. Ensures that at least one of the grouped terms appears along with “agriculture”.

**Using Quotes ““** → To search for exact phrases
Example:”climate smart agriculture”. This ensures the exact phrase is searched for rather than separate words.

# Wrap each term in quotes and join with OR
outcomes_bool <- paste0('"', outcomes_terms, '"', collapse = " OR ")
practices_bool <- paste0('"', practices_terms, '"', collapse = " OR ")
region_bool   <- paste0('"', region_terms, '"', collapse = " OR ")
product_bool   <- paste0('"', region_terms, '"', collapse = " OR ")

Construct the Final Search Query:

The formatted terms are combined into a structured filter string that follows OpenAlex’s API format:

# Combine the three groups into one filter string
filter_string <- paste0(
  "filter=title_and_abstract.search:(",
  outcomes_bool, 
  ") AND (",
  practices_bool, 
  ") AND (",
  product_bool, 
  ") AND (",
  region_bool, 
  ")"
)

4.Building the final search string:

This section of the code constructs an OpenAlex API request, handles pagination, and retrieves all matching records.
base_url defines the OpenAlex API endpoint for retrieving academic works.

final_url appends:
- filter_string → Adds the search criteria.
- sort=relevance_score:desc → Sorts results by relevance (descending).
- per-page=200 → Requests 200 results per page (the API default is typically lower).

cat() prints the full request URL.

base_url <- "https://api.openalex.org/works?"
# We add sorting and request 200 results per page (note the hyphen in per-page)
final_url <- paste0(base_url, filter_string, "&sort=relevance_score:desc&per-page=200")

# Do NOT URLencode the entire URL because that will encode the scheme, causing errors.
cat("Final API URL:\n", final_url, "\n\n")
#> Final API URL:
#>  https://api.openalex.org/works?filter=title_and_abstract.search:("Benefit-cost analysis" OR "Break-even period" OR "Economic valuation" OR "Economic impact" OR "Net present value" OR "Payback period" OR "Willingness to pay" OR "Gross margin" OR "Kilograms per hectare" OR "Cost" OR "Climate change" OR "Risk assessment" OR "Vulnerability assessment" OR "Adaptive management" OR "Community awareness" OR "Drought resistance" OR "Indigenous knowledge" OR "Changing climate" OR "Flood resistance" OR "Soil fertility" OR "Soil degradation" OR "Biodiversity" OR "Soil degradation" OR "Yield loss cost" OR "Revenue" OR "Yield" OR "Food security" OR "Water availability" OR "Water stress") AND ("Agroforestry" OR "Evergreen agriculture" OR "Farmer managed natural regeneration" OR "Silvopastoral systems" OR "Alley cropping" OR "Crop rotation" OR "Cover cropping" OR "Integrated pest management" OR "No tillage" OR "Minimum tillage" OR "Soil conservation" OR "Adapted cultivar" OR "Crop diversification" OR "Crop rotation" OR "Green manure" OR "Water harvesting" OR "Managed grazing" OR "Irrigation" OR "Residue retention" OR "Rotational grazing" OR "Intensive grazing" OR "Biological pest control" OR "Managed grazing" OR "Organic fertilizers" OR "Short duration grazing" OR "Strip grazing") AND ("Belize" OR "Bolivia" OR "Brazil" OR "Colombia" OR "Costa Rica" OR "Cuba" OR "Dominica" OR "Dominican Republic" OR "Ecuador" OR "El Salvador" OR "Grenada" OR "Guatemala" OR "Guyana" OR "Haiti" OR "Honduras" OR "Jamaica" OR "Mexico" OR "Nicaragua" OR "Paraguay" OR "Peru" OR "Saint Lucia" OR "Saint Vincent and the Grenadines" OR "Suriname" OR "Venezuela" OR "Argentina" OR "Bahamas" OR "Barbados" OR "Chile" OR "Panama" OR "Trinidad and Tobago" OR "Uruguay") AND ("Belize" OR "Bolivia" OR "Brazil" OR "Colombia" OR "Costa Rica" OR "Cuba" OR "Dominica" OR "Dominican Republic" OR "Ecuador" OR "El Salvador" OR "Grenada" OR "Guatemala" OR "Guyana" OR "Haiti" OR "Honduras" OR "Jamaica" OR "Mexico" OR "Nicaragua" OR "Paraguay" OR "Peru" OR "Saint Lucia" OR "Saint Vincent and the Grenadines" OR "Suriname" OR "Venezuela" OR "Argentina" OR "Bahamas" OR "Barbados" OR "Chile" OR "Panama" OR "Trinidad and Tobago" OR "Uruguay")&sort=relevance_score:desc&per-page=200

Querying OpenAlex

1.Implementing Cursor-Based Pagination

The OpenAlex API does not return all results at once. Instead, it provides a cursor-based pagination system where each request retrieves a “page” of results and provides a cursor for fetching the next page.

The function fetch_all_hits(start_url) automates retrieving data from the OpenAlex API using cursor-based pagination. It starts with an initial cursor ("*"), encodes it, and appends it to the request URL. The function then repeatedly sends requests (oa_request()), converts responses into a data frame (oa2df()), and stores each page in all_hits. If no results are returned, the loop terminates. It retrieves the next cursor (attr(hits_df,"next_cursor")) to fetch subsequent pages until no new cursor is available. Finally, all pages are combined into a single table using data.table::rbindlist(), ensuring complete data retrieval.

fetch_all_hits <- function(start_url) {
  all_hits <- list()
  cursor <- "*"  # initial cursor value
  
  repeat {
    # URL-encode only the cursor value (so "*" becomes "%2A")
    cursor_encoded <- URLencode(cursor, reserved = TRUE)
    current_url <- paste0(start_url, "&cursor=", cursor_encoded)
    message("Requesting: ", current_url)
    
    # Make the API request
    res <- oa_request(query_url = current_url)
    hits_df <- oa2df(res, entity = "works")
    
    # If no records returned, exit the loop
    if (nrow(hits_df) == 0) break
    
    all_hits[[length(all_hits) + 1]] <- hits_df
    
    # Get the next cursor from the attributes
    new_cursor <- attr(hits_df, "next_cursor")
    if (is.null(new_cursor) || new_cursor == cursor) break
    
    cursor <- new_cursor
    message("Fetched page ", length(all_hits), " with ", nrow(hits_df), " records.")
  }
  
  data.table::rbindlist(all_hits, fill = TRUE)
}

2.Executing the Query and Retrieving Results

This starts downloading OpenAlex records and saves it in a data frame

oa_results <- fetch_all_hits(final_url)
message("Total records retrieved: ", nrow(results))

3.Saving the records in a CSV (optional step)

Here, we save the references from a data table to a csv. Exploring the column names of the results, by calling colnames(results) , we are able to choose which columns we are interesting in. Here, we selected id (unique identifier), display_name (the title of the paper), doi, url, language, relevance_score (indicates how well the paper matches the search query), type (categorizes the publication format), publication_date and abstract. These fields will be using in the next step of the work flow - tagging with GPT

if (nrow(oa_results) > 0) {
  oa_results <- oa_results[, .(id, display_name, doi, url, relevance_score, language, type, publication_date,ab)]
  fwrite(results, "OA outputs/openalex_results.csv")
  message("Results saved to 'openalex_results.csv'.")
} else {
  message("No records found.")
}

Testing results against WoS

Description of the Matching Process

We conducted a comparison between OpenAlex and Web of Science (WoS) results to determine the overlap between the two databases. The process involved multiple steps, from constructing a search query to matching records based on unique identifiers (DOIs). Below is an outline of our approach:

Step Description
1. Translating the Query to WoS Format The original search query was converted into a format compatible with Web of Science to retrieve relevant research articles. This was retrieved and downloaded from the Web of Science web interface.
2. Extracting DOIs From both datasets, the DOI (Digital Object Identifier) was extracted as a unique identifier for each record. This allowed us to perform exact matches between the two databases.
3. Standardizing DOIs To ensure consistency, DOIs were converted to lowercase, and unnecessary prefixes (e.g., https://doi.org/) were removed before performing the matching.
4. Matching Process We determined how many DOIs from OpenAlex were also present in WoS by checking for exact matches. We categorized the results as follows:
5. Visualization of Results A summary table and a histogram were generated to illustrate the overlap and distribution of relevance scores for matched and unmatched papers.

The table below presents the number of records found exclusively in each database and those that overlapped.These numbers indicate that while a significant portion of the papers appear exclusively in OpenAlex, a subset of 1,762 papers were found in both OpenAlex and Web of Science.

#>        Category Count
#> 1 OpenAlex Only 12427
#> 2      WoS Only  3454
#> 3          Both  1762

The histogram above represents the distribution of relevance scores for research papers found in OpenAlex, with an overlay of papers that were also found in Web of Science (WoS).

Blue bars represent papers that are only in OpenAlex, while green bars represent papers that appear in both OpenAlex and WoS.

The relevance score, calculated by OpenAlex, ranks documents based on how closely they match the given search terms, with higher scores indicating greater relevance. OpenAlex computes these scores using text matching algorithms that assess how well a paper’s metadata (title, abstract, keywords) aligns with the search query. Factors influencing relevance include exact match strength (higher weights for papers with exact search term matches in titles/abstracts), keyword frequency (more frequent occurrences improve ranking) and field of study weighting (papers in related fields may be ranked higher). This system helps prioritize the most important papers in the dataset.

Key Observations from the Plot

Most Papers Have Low Relevance Scores:

The majority of papers within OpenAlex results have relevance scores concentrated below 50, with a steep decline as the scores increase. A very high number of papers are clustered near low relevance scores (0–10).

Matched Papers (WoS & OpenAlex) Also Follow This Distribution:

Papers found in both OpenAlex and WoS (green) are mostly concentrated in the low relevance range, following the same pattern as OpenAlex-only papers. This suggests that WoS matches are not necessarily the highest-ranked papers in OpenAlex but are distributed across different relevance levels.

Sparse High-Relevance Papers:

Very few papers have a relevance score above 50. This suggests that only a small fraction of articles are extremely relevant to the exact search terms used.

Limitations of the OpenAlex API

✅ Benefits of OpenAlex API

  1. Free and Open Access – OpenAlex provides a freely accessible, open-source alternative to bibliographic databases like Web of Science and Scopus, allowing researchers to use it without paywalls.

  2. Comprehensive Coverage – It aggregates data from multiple sources, including journals, preprints, and institutional repositories, making it a rich resource for bibliometric analysis.

  3. Relevance Scoring – OpenAlex provides a relevance score to help users identify the most relevant publications for a given search query.

  4. Dynamic and Continuously Updated – Unlike static bibliographic databases, OpenAlex continuously updates its records, pulling in new research outputs in real-time.

⚠️ Limitations of OpenAlex API

  1. Strict Character Limit on Search Queries – One of OpenAlex’s biggest drawbacks is its character limit on search strings, restricting complex queries with many Boolean operators. This limitation makes it difficult to conduct highly specific, detailed searches similar to Web of Science or Scopus.

  2. Limited Peer-Reviewed Filtering – Unlike WoS or Scopus, OpenAlex does not have a built-in peer-reviewed filter, meaning some search results may include preprints and other non-peer-reviewed content.

  3. Fewer Advanced Search Features – While OpenAlex supports Boolean searches, its query syntax is not as advanced or precise as other databases, potentially leading to broader or less refined results.

  4. Potentially Inconsistent Metadata – Since OpenAlex pulls data from various sources, there may be discrepancies in metadata (DOI formats, affiliations, journal names, etc.), requiring additional cleaning for high-quality bibliometric analysis. This may cause duplication of references and requires a thorough de-duplication process.

  5. Variable Data Quality – The completeness and accuracy of metadata depend on OpenAlex’s sources, which might result in missing abstracts, incorrect affiliations, or outdated records.

References

Priem, J., Piwowar, H. and Orr, R., 2022. OpenAlex: A fully‐open index of scholarly works, authors, venues, institutions, and concepts, June 2022.

Scheidsteger, T. and Haunschild, R., 2022. Comparison of metadata with relevance for bibliometrics between Microsoft Academic Graph and OpenAlex until 2020. arXiv preprint arXiv:2206.14168.

Culbert, J., Hobert, A., Jahn, N., Haupka, N., Schmidt, M., Donner, P. and Mayr, P., 2024. Reference coverage analysis of OpenAlex compared to Web of Science and Scopus. arXiv preprint arXiv:2401.16359.