There are many natural language technologies, some of which I touched on last time in Introduction to Unstructured Data. This time I’ll talk about the tradeoffs to consider when turning unstructured data into knowledge that can be understood by computers. The key idea I’ll delve into is how knowledge that comes from unstructured data should be represented in a computer in the first place.
Knowledge representation is the area of computer science concerned with assigning meaning (i.e., a collection of meaningful and potentially actionable facts or beliefs) to symbols, or a language, often to facilitate inferencing from those knowledge elements. Knowledge and representation are both very broad terms, so one might think that there are many options in approaching the task, and they'd be right. Over the decades, differing approaches have given rise to dichotomies of sorts.
The goal of this article is to take a step back and, with our general business use cases from last time in mind, summarize the primary dichotomies in order to suggest a way forward to best use knowledge from unstructured data.
Familiarity with knowledge representation methods and tradeoffs is key to using natural language processing technologies correctly. The way that knowledge is structured can determine its downstream levels of utility, accessibility, and reuse. It's an important consideration for integrating information from many sources, from many technologies, across different times, and being able to automatically infer new knowledge from the pieces you get.
Note: This is not a comprehensive survey of knowledge representation techniques. As an academic field, knowledge representation is both broad and deep. For that kind of survey, see Knowledge Representation and Reasoning.
APPROACHES, DECISIONS, TRADEOFFS
Over the years there have been a variety of methods for representing knowledge originally expressed in natural language. Decisions have been made in defining these approaches, each with a wide variety of tradeoffs and downstream implications. Three of the more fundamental of these decisions are:
- relative vs. explicit
- word vs. concept
- procedural vs. declarative
We’ll cover each of these in turn.
RELATIVE MEANING VS. EXPLICIT MEANING
What in the world of information science is unstructured data, text, most similar to, most blendable with, and therefore understandable with?: other unstructured data.
One decision is whether different parts of unstructured data should just be understood relative to other parts, i.e. finding statistical relationships among fragments of the document set, or corpus, or should require some other, external grounding context, such as a concept structure for explicit meaning.
Some methods assume the former. In many of the common ways that unstructured data are handled these days, such as in search-, clustering-, and classification-based systems, a statistical relationship is calculated between and among documents. That is their primary means of representing a kind of knowledge, rough signatures for rough concepts. Exploration of the corpus is either done through searches, where each query is treated as a little document of its own and the statistical space is queried for similar data, or through faceted navigation where clusters of documents are represented by differentiating phrases.
A drawback of this approach is that it's usually only good for a single hop from one query phrase or document to another phrase or document: it's lossy in that there are many hits of varying, middling strength, to the relevant chunks of knowledge, each of which having a very different meaning. Only the rough topic is preserved, and a human then needs to select from among the choices, the documents.
Some folks take this approach some steps further, using a finer 'knowledge' granularity, with the aim of building a web of relationships between small document fragments. They treat each fragment as a context, or a knowledge nugget, again related to others statistically, and they try to find paths of multiple hops to connect one concept to another. The drawback to this is that the number of paths grows exponentially, the quality of links become tenuous, and humans or some very specific human-defined procedures need to determine relevance, limiting automation.
Many companies, have taken this approach of intracorpus relationships because they started out as search companies. They try to improve their concept of a concept by augmenting keyword-type relationships with statistical expansions of words to those that often appear together. A key hypothesis in this is that concepts that appear together enough are then very related in meaning, or even that set or class membership can be learned from terms appearing near each. It would make sense that broad diffuse topics and their relationships can be derived that way, because the granularity of the context in content would be en par with that, but, ‘red’ and ‘orange’ appear near each other very infrequently. Anything closer to knowledge and meaning will require another approach.
Viewed this way, relative-meaning technologies are fundamentally search technologies, not knowledge representation technologies. Fixed points of reference of some sort are therefore required for grounding in more definite, concrete, and reusable knowledge.
So what form can these reference points—which are required for knowledge representation—take? Are something like concepts necessary, and if so, what does 'concept' mean in this context? Perhaps words (e.g. in a large hyperlinked reference dictionary of some kind) are all that are needed?
WORD VS. CONCEPT
Unstructured data consists of words. It would seem elegant and intuitive to have words stand for their own meanings.
Many proponents of the decision to go with words for meaning use something like WordNet—essentially a dictionary and thesaurus—as a general knowledgebase of what meanings and roles each word can have and how they are related.
For example, by referencing something like WordNet the phrase 'chocolate chips' can be recognized as a phrase that's also a proto-concept, as 'chips', little pieces, that are 'chocolate'.
Unfortunately, few of the higher-level concepts people care about in real-world use cases are actually in the dictionary. The problem is easily seen when jargon in a specific field is prevalent. For example, in a document containing clinical trial information, an attempt to look up 'drug', 'pipeline', and 'phase' together individually in the dictionary will not help your word-based system realize that 'pre-clinical' is something relevant to the concept at hand.
Trying to make a more comprehensive dictionary with more linkages between words and phrases and attempting to tie together anything potentially related in meaningful ways would lead to an intractable combinatorial explosion.
Context, with some core knowledge to bootstrap with, are the crucial components of meaning missing when words are just treated as words. The alternative, when these are present, is a means to classify and represent concepts themselves. These can be as broad or specific as is appropriate, in any conceptual dimension, and, critically, independent of English or Chinese or the way something looks.
This is why even Google, formerly the biggest proponent of both word-based knowledge and relative-meaning technologies, just this past month started moving to and championing the approach of the conceptual model of knowledge.
But how do we get computers to work with concepts? How do computers normally reason? Programs. Computers seem to understand programs much better than they do data.
PROCEDURAL KNOWLEDGE VS. DECLARATIVE CONCEPTS
People understand meaning through the process of interacting with their world. Stories, processes, and sensation are inextricably embedded in the substrate of that understanding.
The closest analog in the computing world might be the neural network of a robot that learns to improve its effectiveness at interacting with its environment over time. Intelligent agents like this robot can form their own internal representations relevant for the specific goals it has, relative to both the input or sensory channels it receives and the ways it can respond to its environment.
An analogue in the world of unstructured data would be an interpreter that reads and establishes beliefs, changes the way it reads depending on those beliefs, and learns where to go to find what it wants.
Knowledge of this kind is called procedural knowledge: conceptual knowledge directly applicable to the real world.
Unfortunately for our uses in natural language processes, text analytics, and workflow optimization, it is situated and tacit knowledge, rather than explicit knowledge. It is job-dependent and agent-specific, and is much less general, much less reusable, than declarative knowledge. When knowledge is developed as procedures in this way, rather than explicit knowledge structure, it is siloed, and thus as a model cannot scale well. There are too many non-portable assumptions and structural dependencies in place.
In contrast, with declarative knowledge, concepts are expressed as interpretable facts, sentences, or propositions, often associated with a context. The lack of procedural logic in them is a strength, allowing multiple types of algorithms to operate over the same explicit, declarative knowledge, potentially cooperating, even without coordination.
A metaphor in the relational database world comparing procedural to declarative knowledge would be an app that's heavily dependent on black-box stored procedures, versus a purely data-driven application.
Declarative knowledge is able to blend principles, conceptual models, causal networks, reasoning, and configurable hierarchical planning, and incrementally add knowledge from different but relevant contexts.
THE DECLARATIVE LANDSCAPE
Declarative knowledge includes the structure of information—its schema or domain model—in the way that people, subject matter experts, think of the landscape of what they know.
However, even declarative knowledge representation has its own schisms of sorts:
- more rule-oriented vs. more object-oriented approaches
- pure frame representations vs. richer-ontology semantic networks
- consistency-guaranteed vs. inconsistency-tolerant representations
- purely symbolic vs. probability-enhanced semantics.
Regardless of the schisms, the important thing is that these variations can still work together.
All of these representations are explicit, are openly accessible to logic, and are able to share concepts with each other. The better modern semantic knowledge integration platforms support all of these variations in tandem. The lingua franca of these different approaches and capabilities are the explicit intensional and irreducible semantic concepts.
There are many types of logic, many types of rules, many types of reasoners, but when they utilize the same concept space they become compatible, providing some great leverage. With portable rules, new facts and even new rules can be deduced, or reasoned, by potentially multiple different reasoners with their own specializations.
Because all of these varieties can live together, because the knowledge is structured in the way a subject matter expert thinks of things, and because the knowledge is so reusable, we can actually think of declarative knowledge as a whole as a single approach in its own right. In fact, declarative knowledge is the sweet spot between structure and flexibility in knowledge representation.
The framework for handling all of the above declarative variations, semantic network technology, is thereby the lowest level of groundwork needed to raise information up and imbue sufficient meaning to a level that breaks what we might call the meaning barrier, where the knowledge can very easily be leveraged in ever better ways, and is not trapped or overly lossy.
A declarative framework can also easily connect related contexts and provenance, like connecting an equity's calculated buy/sell recommendation, to the target price that went into that calculation, to the research report it was stated in, to the author of that report, to that author’s previous work history. The line between data and metadata drops away.
With this base, we can move on to more value-added considerations.
We've found that explicit meaning is needed for anything we'd want to call knowledge, concepts are much better than words for making things workable, and declarative knowledge opens up new dimensions of flexibility and reusability.
Next time in our third installment on unstructured data, we will directly address the convergence of unstructured and structured data, the union of text analytics and semantic technology, where core knowledge can come from, and how cooperation with little or no coordination empowers the next generation of horizontal big data.