Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity

Daniela Raciti, Kimberly M Van Auken, Valerio Arnaboldi, Christopher J Tabone, Hans-Michael Muller, Paul W Sternberg

September, 2025

Abstract

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labour-intensive and thus high-performing machine learning (ML) methods that improve biocuration efficiency are needed. Here, we report on sentence-level classification to identify biocuration-relevant sentences in the full text of published references for two gene function data types: gene expression and protein kinase activity. We performed a detailed characterization of sentences from references in the WormBase bibliography and used this characterization to define three tasks for classifying sentences as either (i) fully curatable, (ii) fully and partially curatable, or (iii) all language-related. We evaluated various ML models applied to these tasks and found that GPT and BioBERT achieve the highest average performance, resulting in F1 performance scores ranging from 0.89 to 0.99 depending upon the task. Moreover, our inter-annotator agreement analyses and curator timing exercises demonstrated that curators readily converged on classification of high-quality training sentences that take a relatively short period of time to collect, making expansion of this approach to other data types a realistic addition to existing biocuration workflows. Our findings demonstrate the feasibility of extracting biocuration-relevant sentences from full text. Integrating these models into professional biocuration workflows, such as those used by the Alliance of Genome Resources and the ACKnowledge community curation platform, might well facilitate efficient and accurate annotation of the biomedical literature.

Type

Journal article

Publication

Database