In a previous post, I explored some of the overlap between writing code and patent analysis queries. Looking more closely at the analysis side of patent analytics, let’s explore why some machine learning methods are not accepted and why that might be.
Howe, introduced previously, states that munging and wrangling, otherwise known as data preparation can be up to 90% of the time spent performing an analysis. Some in data science use various terms like data munging or data wrangling for the cleaning, culling and evaluation of initial results. However, the skill-set of patent professionals can lead to a lack of uptake or appreciation of some analytics tools, although the application of supervised machine learning could help them, especially when they can be applied to the time-consuming data preparation steps.
As an example, let’s look at a method for developing a decision tree, which allows for a simple yes/no between two items. We can compare the process of results evaluation in patent analysis to a proof satisfiability tree (SAT), an example of which, is provided by Genesereth and Kao. Mike Genesereth and Eric Kao are instructors and conduct research on logic at Stanford’s Computer Science Department. The SAT method provides for seeing if a sentence is satisfiable, with the parts of the sentence here being labeled P, Q and R, and the 0 and 1 showing if the parts as true or false (Figure 1). For a small number of combinations, these could easily be explored manually but, with a large number of combinations, this becomes unfeasible. When looking into satisfaction, as they go through each part, if an upper category can be shown to be unsatisfied, everything under a certain path no longer needs to be evaluated, as shown below looking at the grayed out area. Not having to evaluate all of the grayed areas could indeed save time.
In order to apply this technique to patent analytics, I will refer to my previous post where I mentioned a query from Jurafsky’s Natural Language Processing:
In children with acute febrile illness, what is the efficacy of single-medication therapy with acetaminophen or ibuprofen in reducing fever?
I chose this example, since this query was introduced as a type that machine learning methods still have difficulties with, but seemed common to the type of queries seen in patent searching. Assigning generic variables, along the lines of the SAT tree example, would produce the following:
R= reducing fever
While a searcher starts out very broad and uses more precision to narrow the query, the narrowness of the search is sometimes inversely proportional to the breadth of the query. Here, the P, Q and R would represent the query elements, and the O and 1 to be whether the query matched the record answer. In theory, for aiding in evaluation; when a higher level is knocked out, the analyst doesn’t have to look further down into the shaded areas.
An analogy to the P, Q, and R could apply to the use of filters and an annotator with Clearstone Innovator to maneuver through a hierarchy with the goal of narrowing down to find remaining patents that are likely of interest. As mentioned in the Clearstone Innovator post, this approach would not be applicable for all patent search types.
A high level SAT model holds for a straightforward example of P, Q and R, but patent analysts don’t always deal with straightforward issues. There can be resistance from patent analysts to such tools due to the following considerations:
- While maybe the P, Q, and R appear in the same record, there can also be valid answers that don’t have P, but have Q and R, such as in 103 issues.
- There are some cases that could deal with relevancy. Possibly, this extends past some ranking, like Edge and Page ranking. Such rankings are sometimes based on content designed for sharing, not for patent data, which I have seen, referred to as “purposefully obfuscated”.
- There are always concerns on true and false positives and negatives. In patent searching, it would not be uncommon to see P in a laundry list of diseases; Q in a laundry list of compounds, R in a laundry list of indications, Q in a class compounds and not listed specifically, various combinations of such P/Q/R, etc.
Noting that there is not always a single strategy for every search type, there is also not always a single analytical technique or tool to accomplish a search. While there could be areas of interest for some types of patent searching, some of the automated processes can still be explored beyond the mocked example listed here. As seen in this post, along with several others on patinformatics.com pertaining to machine learning, there are aspects of patent analytics, which can benefit from machine learning.