Written by: Bilal Ben Mahria

Editorial Contributions: Benjamin Paul Rode, Garima Gujral, Stratos Kontopoulos

Table of Contents

  • Abstract

  • Introduction

  • What LLMs Bring: The Gift of Speed

  • What LLMs Lack : LLMs as Generative Ontologists

    • Technical Limitations

      • Token Constraints and Output Completeness

      • Domain Adaptation Challenges

      • Semantic and Axiom-Level Errors

    • Quality and Consistency Problems

    • Reliability Issues

      • Hallucinations and Factual Inaccuracies

      • Model Instability and Version Sensitivity

  • The Logic Gap: Why Probability is Not Logical Reasoning

  • The Human Safety Net

  • Conclusion

Abstract

Recently, a growing number of startups and technology companies have begun promoting a compelling—but dangerously simplified—narrative: that Large Language Models (LLMs) can automatically generate ontologies, and therefore reduce or even eliminate the need for human ontologists. This claim is appealing from a cost and speed perspective, yet it overlooks a fundamental distinction between generating structured text and engineering formal, semantically consistent knowledge systems. My motivation for this work arises precisely from this tension.

While LLMs demonstrate impressive capabilities in producing ontology-like artifacts, their outputs often lack the explicit semantics, context, and logical guarantees required for robust knowledge representation. As a result, organizations risk replacing deep expertise with probabilistic approximations, potentially introducing ambiguity, inconsistency, and hidden technical debt into their systems. This blog aims to critically examine the idea of “LLMs as generative ontologists,” not to dismiss their value, but to clarify their role, expose their limitations, and advocate for a more balanced, hybrid approach where human ontologists remain essential architects rather than obsolete intermediaries.

Designing a map of knowledge

Imagine we are trying to design a vast and detailed map, not of lands, but of knowledge itself. A map that brings order to complex fields such as medicine, biology, or cultural heritage, where each concept has a place and every connection carries meaning. In the world of data, we call this map an ontology: A way of giving every idea a home and every connection a meaning. Whether we are organizing medicine or biology fields, an ontology is a reflection of how we see reality. More specifically, ontology can answer the following question:

  • What exists? (e.g., Are there physical objects, minds, numbers, or only matter?).

  • What is the nature of a specific entity? (e.g., What is a person? What is time?).

  • How are entities structured or related? (e.g., Is reality made of independent things or interconnected processes?).

  • Does a certain type of entity exist? (e.g., Does God exist? Do universal truths exist?)

However, knowing what exists is only half the story. To truly live in the world, we also need to know how to act. This brings us to another concept from the philosopher Immanuel Kant called Deontology or Deontological Ethics, in which doing the right things is determined not by outcomes but also by universal social laws that govern allowable actions such as “Don’t lie”, or  “don’t kill”. In this context, while ontology is about "what is," deontology is about "what we should do." It is a system of rules—like "don't lie"—that tells us how to behave regardless of the outcome. In our digital world, if Ontology is the structure of reality, Deontology is the set of rules (SHACL, OWL Axioms) that makes sure the data follows the laws of logic and ethics.

Today, we are facing a big change. With the rise of Artificial Intelligence (especially LLMs), we are asking a new question: Can these LLMs help us build the knowledge and a set of rules for action (Ontology+DeOntology)? Because LLMs can read millions of pages in a heartbeat, they are like a librarian who never sleeps. They can find hidden patterns and organize information at a speed no human could ever match. They offer us a way to structure the world’s knowledge on a scale we’ve never seen before. Yet, this possibility comes with a deeper reflection. While LLMs are powerful in recognizing patterns, ontologies require more than that, they require clarity, consistency, and meaning that goes beyond words. This raises an important question:

Can a system that learns from patterns truly understand the structure of the world and its constraints, or is it only imitating it? This question sits at the center of the debate around modern AI. Most current models do not reason from universal causal principles; instead, they operate within bounded contexts, drawing on statistical regularities found in data. As a result, they can reproduce the appearance of logic and understanding without necessarily being connected to the real mechanisms that govern the world. Their outputs may sound coherent, structured, and even insightful, but that coherence often reflects learned correlations rather than genuine comprehension. 

Using LLMs as generative ontologists may accelerate the process and open new possibilities. However, it also challenges us to rethink the role of human expertise in shaping knowledge. Rather than replacing the human mind, LLMs may become tools that assist, suggest, and inspire—while the responsibility of true understanding remains in human hands.

What LLMs Bring: The Gift of Speed

The primary advantage of using LLMs in the field of ontology engineering is their ability to provide generative flexibility. They act as “support tools” that can significantly reduce the time and high-level expertise traditionally required to start an ontology project from scratch.

  • Rapid Prototyping: LLMs are excellent at "draft generation," allowing researchers  and engineers  to see a rough version of an ontology in seconds.

  • Accessibility: They lower the barrier to entry, helping those who may not be experts in complex coding languages (like OWL) to begin structuring their data.

  • Initial Brainstorming: LLM models can suggest initial categories and relationships that a human might have overlooked, serving as a powerful "brainstorming" partner.

At first glance, using an LLM seems like a dream come true because they can read and write faster than any human. However, when we actually put them to work, we find that they are more like enthusiastic beginners than expert architects. While they can create a rough draft quickly, they often struggle with the deep work. For example, in the complex world of life sciences, a professional map might have nearly 9,000 categories, but an AI-generated one might only manage about 176 before it gets tired or runs out of space.

What LLMs Lack : LLMs as Generative Ontologists

Despite their speed, LLMs suffer from several drawbacks that prevent them from operating independently as generative ontologists.

Technical Limitations

Token Constraints and Output Completeness

LLMs have strict limits on how much text they can generate at once, known as token constraints. This constraint leads to incomplete outputs and shallow hierarchical structures in generated ontologies. Existing LLMs struggle to generate ontologies with multiple hierarchical levels, rich interconnections, and comprehensive class coverage due to constraints on the number of tokens they can generate and inadequate domain adaptation. Some domains are complex such as Finance and Life Science where  the domain knowledge exceeds what LLMs can generate with their token limits.

A concrete example from my experience illustrates this limitation clearly. During my time at Upwork, I explored the use of Llama 3.1-8b to automatically generate a skill taxonomy, where the internal gold standard already consisted of more than 2,000 well-curated skills. Despite multiple iterations and prompt refinements, the model consistently failed to approximate the breadth and structure of the existing taxonomy. Its outputs were systematically incomplete, often covering only a limited subset of skills, while missing critical concepts and relationships required for a production-grade ontology. This gap was not merely quantitative but also structural—what was generated resembled a plausible list rather than a comprehensive, normalized, and semantically grounded taxonomy.

This example highlights a key issue: while LLMs can generate fragments of an ontology, they struggle to achieve the coverage, consistency, and completeness that are essential for real-world knowledge systems.

Domain Adaptation Challenges

Inadequate domain adaptation represents another critical technical limitation. LLMs struggle with specialized domains that require specific terminology and document construction rules. This insufficient domain adaptation results in generic ontologies that lack the depth and complexity required for advanced reasoning tasks. The generated ontologies often contain overly generic terms that dilute the ontology’s focus on domain-specific concepts. LLMs are trained on the general internet.

A clear example can be observed in the domain of skills and talent modeling. When attempting to generate a skill ontology, where concepts must be precisely defined, normalized, and organized into meaningful hierarchies—LLMs often produce overly generic and poorly structured outputs. For instance, instead of distinguishing between closely related but semantically different skills such as Python, Python for Data Science, and Python for Backend Development, the model may collapse them into a single broad category like Programming or Software Development. In other cases, it may introduce vague classes such as Technical Skill or Digital Skill, which add little value and fail to capture the granularity required for tasks like talent matching or skill-based ranking.

Semantic and Axiom-Level Errors

An axiom is a truth we do not prove, because it is the very ground upon which proof becomes possible. It is not questioned, not because it is beyond doubt, but because it stands at the beginning of thought itself.  In the world of ontology, axioms are more than technical rules or constraints. They are the quiet principles that give shape to meaning. They act like the ethics and values of a domain. They define the Deontological part of the ontology. Without axioms, knowledge would be scattered; with them, it becomes a structured and intelligible world.

Large Language Models often follow patterns, but these patterns can lead them into subtle mistakes. They may confuse meaning with appearance, and in doing so, create small but important gaps between definitions and the true structure of an ontology. For example, they might remove essential rules, or change the logic—treating things as if they must all be true at once, instead of allowing alternatives. There is also a quieter limitation. These models are not precise with structure. They can struggle to count correctly or keep track of all the elements they create. In ontology design, where every class and every rule matters, even a small miscount can change the whole system. This reveals something deeper: LLMs are good at shaping language, but less reliable when exact structure and logic are required. They can imitate understanding, but they do not always preserve the truth behind it.

Quality and Consistency Problems

A formal ontology requires absolute "cleanliness" to be functional for a reasoning engine. Unfortunately, generative models often produce semantic noise, which means schema violations that undermine the logic of the entire graph. A professional audits have identified three recurring Structural Flaws:

  • Multiple Domains/Ranges: Assigning conflicting specifications to a property, which results in unsatisfiable classes if the domains/ranges are disjoint.

  • Redundant/Overlapping Classes: Creating duplicate definitions for the same concept, leading to "ontological overload" and data fragmentation.

  • Inconsistent Naming: Violating uniform naming conventions, which prevents automated tools from effectively mapping or merging knowledge silos.

At a deeper level, these challenges reflect the nature of LLMs themselves. They do not operate within a fixed structure of meaning, but rather generate fluid interpretations shaped by patterns in language. Unlike formal ontologies, which impose discipline, clarity, and logical coherence, LLMs produce a more shifting and probabilistic picture of the world: One that resembles discourse more than it does a well-defined system of knowledge.

Reliability Issues

Hallucinations and Factual Inaccuracies

The tendency toward hallucinations—generation of false or unsupported statements—represents a critical reliability concern for LLM-based ontology generation. This problem manifests as structurally correct but factually inaccurate texts, which is particularly problematic for ontologies that must faithfully represent domain knowledge. LLMs struggle to explain their generation results, making it difficult to verify whether generated ontological content accurately reflects intended domain concepts. The risk of generating incorrect data increases substantially when LLMs are used without appropriate control mechanisms.

Model Instability and Version Sensitivity

Model instability across versions introduces unpredictability into LLM-based ontology engineering workflows. This suggests that ontology construction approaches relying on specific LLM behaviors may become invalid following model updates. The evolving nature of AI capabilities introduces complexities for sustained ontology engineering efforts, as methods that work with one model version may produce different results with subsequent versions.

The Logic Gap: Why Probability is Not Logical Reasoning

The fundamental divide in AI strategy lies between probabilistic text generation and deterministic logic. A dangerous misconception in the industry is that LLMs can run "OWL (Web Ontology Language) reasoning" natively. They cannot. Deterministic reasoning requires a dedicated OWL engine; LLMs merely predict the next likely token based on a training distribution. This lack of logical rigor leads to critical axiom-level errors. One of the most severe is the misuse of allValuesFrom constructs. LLMs frequently alter logical operators, such as substituting a union (∪) for an intersection (∩). By making this swap, the model effectively opens the door to invalid data, as a union is far less restrictive than an intersection. This undermines the validation utility of the entire knowledge graph. In reality, we face a "Strings vs. Things" paradox

  • LLMs-Companions of Strings, Strangers to Meaning: Counter-intuitively, LLMs are more accurate when prompted with Natural Language (strings) rather than formal semantic syntax (RDF). Using formal "Things" actually degrades the model’s extraction performance.

  • The Novice Gap: Evaluated against experts, LLM output is comparable only to human novice modellers.

To perform true "Deterministic Reasoning", such as calculating exactly how many "Cloud Engineers" exist based on overlapping, complex skill sets, we cannot rely on an LLM. We need a dedicated OWL reasoning engine. LLMs are pattern matchers, not logic engines.

The Human Safety Net

So, does this mean we shouldn't use generative AI for this work? Not at all. LLMs are excellent support tools as long as there is a human-in-the-loop. Think of the AI as a novice assistant who can do the boring, heavy lifting of creating a first draft. Then, a human expert steps in to fix the hallucinations, delete the redundant entities  and add the deep layers that the AI missed. By combining the speed of the AI with the wisdom of a human, we can build better ontologies than either could create alone. 

Some claim that the era of the human ontologist is over, believing that Large Language Models can architect the knowledge on their own. This is a profound misunderstanding. In fact, the true power of AI is unlocked only through a "human-in-the-loop" strategy. We are moving toward a multi-level future where AI agents act as specialized workers: some gathering data, others verifying facts, while a human architect oversees the structure and validates the proposed outcomes. Furthermore, by integrating Retrieval-Augmented Generation (RAG) and Graph-based Retrieval Augmented Generation (Graph-RAG), we anchor these systems to a bedrock of verified truth, ensuring they don't just "guess," but actually reflect reality.

Conclusion

Currently, LLM-generated ontologies are comparable to the work of human novice modelers. They are wonderful for creating a first draft or a quick sketch of a field, but they lack the reasoning skills and deep expertise to build the final product (See Ontology as Product).

The future of "Generative Ontology" lies in a partnership: the AI provides the speed and the draft, while the human expert provides the logic, the depth, and the final seal of approval.

How Can You Contribute To The Newsletter?

To contribute to the newsletter, please fill out the following google form:

Your responses will help shape future editions, guide the topics we investigate, and inform the kinds of conversations we facilitate. 

And if you would like to be deeper involved in a community of practitioners navigating the responsible adoption of AI technologies, feel free to visit swarmcommunity.org to access resources and / or book a call to join the SWARM community. 

Keep Reading