Extracting Traceable Formal Representations from Natural Language Regulatory Documents

RTG Home | CIS Home | Penn Engineering | Penn

Regulatory documents, which include the vast bodies of legislation, operating procedures and organizational policy, are meant to be accessible to the people affected by them. Hence, they have to be in natural language (NL). On the other hand, regulations are expected to be consistent, and the governed entities/events are expected to conform to the regulation.

For example, the Food and Drug Administration's Code of Federal Regulations governs the bloodbanks in America. The bloodbanks perform safety-critical functions like the testing of blood for communicable disease agents (like HIV). It is highly desirable to determine whether:

The CFR is consistent, and
A bloodbank's implementation of such a function conforms to the CFR

We approach these problems under the broad framework of formal verification:

The regulation (CFR) is represented in a logic, and
The operations of an organization (bloodbank) are represented as abstract models
Theorem proving techniques are used to establish consistency
Model checking and/or runtime checking is used to determine conformance

A challenge in taking this approach is that regulatory bases like the CFR are large (about a million words) and complex. Translating the regulation to logic is a difficult task. In addition, for error diagnosis, we need to maintain a correspondence between the regulation in NL and logic. For example, if an organization does not conform to the regulation, a checker should produce the subset of the operations that violate a law, and the specific law that is violated.

Our goals are to design logics, and use natural language processing (NLP) techniques to aid in the translation of regulation to logic. We discuss the challenges in each of these areas below.

Designing a Logic for Regulation

An important constraint on the logic is that it should be possible to translate the regulation to logic one sentence at a time. In other words, for each NL setence there should be a corresponding logic statement which captures its interpretation as needed for conformance. What makes this challenging is that sentences in regulation refer to others. For example, the exceptions to an obligation are commonly given as long lists of permissions, and this makes the creation of a sentential representation in logic difficult. The need for imposing this constraint is motivated in our papers, and the advantages are as follows:

A correpondence between natural language and logic is obtained, which aids in error diagnosis.
The NLP techniques can focus on sentence-level processing, and
The expressive power of the logic can be evaluated. If we frequently need to combine NL sentences to translate them into logic, it would suggest that the logic needs to be augmented

Using NLP at the sentence-level

Our goal is to use NLP to assist a requirements engineer in translating the regulation to logic. The idea is to extract representations from the text, which a requirements engineer interacts with to produce the formal representation. There are two specific tasks, which we aim to assist in:

Ensuring consistency in predicate definitions
Creating the logic translation of a sentence

We discuss these tasks below.

A key problem in translating regulation to logic is one of vocabulary. Logics typically rely on atomic symbols -- propositions or predicates. The necessary inferences can succeed only if the predicates are consistently defined. For example, we need to infer that if a donation is not tested, then it is not tested for HIV. Suppose we created a predicate test(x), which is true iff x is tested, and a predicate testForHIV(x), which is true iff x is tested for HIV. The relationship between the two predicates would have to be axiomatized separately, e.g., testForHIV(x) => test(x). It is difficult to ensure that all the necessary axioms are provided. We address this problem by creating schemas from the text. A schema is a set of class and type definitions. Predicates, such as test(x) and testForHIV(x), are treated as abstract predicates, and a concrete definition is given using a logic which refers to the schema. Examples of a schema and predicate definitions can be found in our case study. Our goal is to use NLP to assist in extracting a schema.

To assist in creating a logic representations of a sentence, we construct a tree-like structure from the sentence. The leaves of the tree correspond to (abstract) predicates, and the representation of the sentence is built up from the leaves. The challenges are -- (a) computing the tree structure given the sentence, and (b) combining the representation of leaves into the representation of the sentence. We are currently manually annotating sentences with the tree structures. Once a modest-sized corpus is available, we will adapt parsing techniques to compute these representations. A screenshot of our annotation tool is available here.

RTG Home | CIS Home | Penn Engineering | Penn