Propositional Rules, Logical Decision Trees, and the Satisfiability Problem

Here is a question I received the other day: 

Can the propositional rules generated by logical decision trees be recast as a satisfiability problem, making decision trees NP-complete?

I’m going to break my answer down into segments.

What are Propositional Rules?

In order to understand what proposition rules are, it is important to understand what a proposition is. A proposition is the fundamental building block of logic. 

A proposition is a sentence that is either true or false, but not both. If a proposition is true, it is denoted by the capital letter T. If a proposition is false, it can be denoted by the capital letter F. Below are some examples of propositions:

  • The sky is blue. …. T
  • 10 + 10 = 20. ….T
  • The letter “p” is a vowel…..F

Given two propositions p and q, we can create a “if p then q” statement that is known as an implication. Implications can be written as p → q. Here is an example of an implication:

  • p = The relative humidity is greater than 70%
  • q = It is raining tonight
  • p → q = “If the relative humidity is greater than 70% then it is raining tonight.”

As seen in the real-world example above, propositional rules can be created by combining propositions in different ways.

What is the Satisfiability Problem?

If I have a propositional formula denoted as a vector of boolean variables, for example x = (x1 = F, x2 = T, x3 = F, x4 = T, x5 = F), can we find an assignment of values for these boolean variables that would evaluate the propositional formula to True? This statement denotes that satisfiability problem.

Consider, for example. The following formula:

  • F = x1 AND x2 AND x3 AND x4 AND x5

The formula denoted as F above will evaluate to TRUE if all the variables in the vector x are assigned true. Therefore, the formula is satisfiable.

This satisfiability problem was proven to be NP-complete by Stephen A. Cook.

What does NP-Complete Mean?

Computational problems that belong to the class of NP-complete problems are those that have no known efficient (polynomial time) solution algorithms. You can beat your head against a wall trying to find a fast solution to these sorts of problems. However, given solution to these NP-complete problems, you can verify that solution in polynomial time (i.e. quickly). 

An example of an NP-complete problem is the Vertex Cover Problem some of you might have seen in an algorithms or data structures class.

Can the Propositional Rules Generated by Logical Decision Trees Be Recast as a Satisfiability Problem, Making Decision Trees NP-Complete?

Yes they can. Once we have the propositional rules, we then need to either find a variable assignment that makes it satisfiable, or we can report that no such assignment exists.

Let’s suppose we are dealing with binary decision trees, and we have three variables, a, b, and c. There are 2^n possible truth assignments that would need to be verified.

decision-tree

By the way, there is a famous paper by Hyafil and Rivest proving that optimal construction of decision trees is NP-complete.

References

Cook, S. A. and Mitchell, D. G. “Finding Hard Instances of the Satisfiability Problem: A Survey.” In Satisfiability Problem: Theory and Applications (Piscataway, NJ, 1996) (Ed. D. Du, J. Gu, and P. M. Pardalos). Providence, RI: Amer. Math. Soc., pp. 1-17, 1997.

My Master’s Thesis

Straight from the vault, my Master’s thesis: An Analysis of the Robustness and Relevance of Meteorological Triggers for Catastrophe Bonds.

Abstract

Each year weather-related catastrophes account for an estimated United States dollars (USO) $40 billion in damage across the world, although only a fraction of this risk of loss is insured. Losses from hurricanes in the United States have increased over the past several years to the extent that many insurance companies have become increasingly reluctant to insure in certain locations along the coast. Several insurance companies have become insolvent as a result of the active hurricane seasons of 2004 and 2005. In order to cope with this hurricane risk, some insurance and reinsurance firms have shifted part of their risk to the capital markets in the form of catastrophe bonds.

Two problems are observed with catastrophe bonds based on parametric triggers (e.g. Saffir-Simpson scale rating of a hurricane at landfall). First, the trigger mechanisms are measured imprecisely, with the degree of imprecision depending on the choice of trigger mechanism, the available sensor systems, and the methods by which meteorologists analyze the resulting observations. Second, the trigger mechanisms might not relate well to the economic harm caused by the weather phenomena, suggesting that they were not selected on the basis of adequate understanding of relevant meteorology and its relationship to storm damage. Both problems are documented, and perhaps ameliorated in part, by a thorough study of the relevant meteorology and meteorological practices.

Development of a set of robust and relevant triggers for catastrophe bonds for hurricanes is the objective of this study. The real-time and post-landfall accuracy of measured hurricane parameters such as minimum central pressure and maximum sustained surface wind speed were analyzed. Linear regression and neural networks were then employed in order to determine the predictability of storm damage from these measurable hurricane parameters or combination thereof. The damage data set consisted of normalized economic losses for hurricane landfalls along the United States Gulf and Atlantic coasts from 1900 to 2005. The results reveal that single hurricane parameters and combinations of hurricane parameters can be poor indicators of the amount of storm damage.

The results suggest that modeled-loss type catastrophe bonds may be a potentially superior alternative to parametric-type bonds, which are highly sensitive to the accuracy of the measurements of the underlying storm parameters and to the coastal bathymetry, topography, and economic exposure. A procedure for determining the robustness of a risk model for use in modeled-loss type catastrophe bonds is also presented.

How to Determine What Machine Learning Model to Use

With all of the different machine learning models out there (unsupervised learning, supervised learning, and reinforcement learning), how do you go about deciding which model to use for a particular problem?

One approach is to try every possible machine learning model, and then to examine which model yields the best results. The problem with this approach is it could take a VERY long time. There are dozens of machine learning algorithms, and each one has different run times. Depending on the data set, some algorithms may take hours or even days to complete.

Another risk of doing the “try-all-models” approach is that you also might end up using a machine learning algorithm on a type of problem that is not really a good fit for that particular algorithm. An analogy would be like using a hammer to tighten a screw. Sure, a hammer is a useful tool but only when used for its intended purpose. If you want to tighten a screw, use a screwdriver, not a hammer.

When deciding on what type of machine learning algorithm to use, you have to first understand the problem thoroughly and then decide what you want to achieve. Here is a helpful framework that can be used for algorithm selection:

Are you trying to divide an unlabeled data set into groups such that each group has similar attributes (e.g. customer segmentation)?

If yes, use a clustering algorithm (unsupervised learning) like k-means, hierarchical clustering, or Gaussian Mixture Models.

Are you trying to predict a continuous value given a set of attributes (e.g. house price prediction)?

If yes, use a regression algorithm (supervised learning), like linear regression.

Are we trying to predict discrete classes (e.g. spam/not spam)? Do we have a data set that is already labeled with the classes?

If yes to both questions, use a classification algorithm (supervised learning) like Naive Bayes, K-Nearest Neighbors, logistic regression, ID3, neural networks, or support vector machines.

Are you trying to reduce a large number of attributes to a smaller number of attributes?

Use a dimensionality reduction algorithm, like stepwise forward selection or principal components analysis.

Do you need an algorithm that reacts to its environment, continuously learning from experience, the way humans do (e.g. autonomous vehicles and robots)?

If yes, use reinforcement learning methods.

For each of the questions above, you can ask follow-up questions to hone in on the appropriate algorithm to use on that type.

For example:

  • Do we need an algorithm that can be built, trained, and tested quickly?
  • Do we need a model that can make fast predictions?
  • How accurate does the model need to be?
  • Is the number of attributes greater than the number of instances?
  • Do we need a model that is easy to interpret? 
  • How scalable a model do we need?
  • What evaluation criteria is important in order to meet business needs?
  • How much data preprocessing do we want to do?

Here is a really useful flowchart from Microsoft that presents different ways to help one to decide what algorithm to use when:

machine-learning-decision-chart
Source: Microsoft

Here is another useful flowchart from SciKit Learn.

scikit-learn

Slide 11 of this link shows the interpretability vs. accuracy tradeoffs for the different machine learning models.

This link provides a quick rundown of the different types of machine learning models.