Abstract
Taxonomies and ontologies of research topics (e.g., MeSH, UMLS, CSO, NLM)
play a central role in providing the primary framework through which
intelligent systems can explore and interpret the literature. However, these
resources have traditionally been manually curated, a process that is
time-consuming, prone to obsolescence, and limited in granularity. This paper
presents Sci-OG, a semi-auto\-mated methodology for generating research topic
ontologies, employing a multi-step approach: 1) Topic Discovery, extracting
potential topics from research papers; 2) Relationship Classification,
determining semantic relationships between topic pairs; and 3) Ontology
Construction, refining and organizing topics into a structured ontology. The
relationship classification component, which constitutes the core of the
system, integrates an encoder-based language model with features describing
topic occurrence in the scientific literature. We evaluate this approach
against a range of alternative solutions using a dataset of 21,649 manually
annotated semantic triples. Our method achieves the highest F1 score (0.951),
surpassing various competing approaches, including a fine-tuned SciBERT model
and several LLM baselines, such as the fine-tuned GPT4-mini. Our work is
corroborated by a use case which illustrates the practical application of our
system to extend the CSO ontology in the area of cybersecurity. The presented
solution is designed to improve the accessibility, organization, and analysis
of scientific knowledge, thereby supporting advancements in AI-enabled
literature management and research exploration.