It is one of those old statistical measures that most of us take for granted – 80% of the information in patents is never published anywhere else. Information professionals have been saying this for years and, for the most part, many of us in the patent information profession have simply taken this a true statement. In fact, if questioned, this is one of those statistics that people have used for so long most aren’t even sure where it originated from, or what proof there is for it. Such was the case when this question was raised recently on the Patent Information User Group’s (PIUG) discussion forum. The post elicited a number of relevant comments, and is worth reading for historical perspective on the statistic.
It is generally accepted that the source of this saying is the “Eighth Technology Assessment and Forecast Report” from the USPTO published in 1977, but as pointed out in the comments of the PIUG post this study was done with a very small amount of data. A statistic like this is also likely to be technology dependent with different areas being more or less focused on only publishing via patents. The PIUG thread also includes comments discussing more recent research on this question, using chemical information that was published in 2005. These studies came to similar conclusions as the original statistic, but did in fact vary depending on the chemical sub-discipline studied.
One of the threads that run through all of the attempts to answer this question revolves around the use of chemical information to study the issue. This is likely due to the existence of data from the American Chemical Society’s Chemical Abstracts Service (CAS), an organization that does a pretty comprehensive job of capturing discrete chemical entities from both patent and non-patent literature. Specific chemical substances are only one type of potential “technology” but considering how difficult it generally is to capture, and subsequently search for other types of technology it makes sense to use something compartmentalized, like chemical substances to look at a question like this one.
CAS has been saying for many years now that more than 70% of the new substances added to the CAS Registry from the literature come from patents. This statement, in and of itself is interesting, and while it doesn’t directly answer the question associated with the oft quoted statistic it is a relevant piece of information since the majority of substances in the file only have one publication associated with them.
So while we have some evidence that, at least for the chemical sciences, this statement about patent publications is likely true, is there a way to more definitively study the issue. The previous studies have always had to settle for small sample sizes, or make certain assumptions due to the sheer volume of data associated with chemical compounds, and the limitations associated with analyzing them. Thinking about this it occurred to me that while that used to be the case we now have a powerful tool for studying large amounts of chemical information. I talked about this tool in a previous post when I provided a first look at the New STN system. In that post I talked about the idea of Big Data, and how the people behind New STN were taking advantage of recent advances in data analytics to bring the benefits of big data to the world of chemical information. The question of what percentage of chemical information described in patents is ever discussed anywhere else seems like an ideal example of the sort of question a big data solution for chemical information could answer.
One of the features of New STN is the ability to transfer information quickly between multiple files within the system. In this case I am interested in extracting substance information from the CAplus file, which is the database were literature references, both patents and non-patents are stored, and transferring that data into the Registry file, where the substances are kept. I am also interested in finding all references associated with chemical substances once I have identified them. The commands to do this on New STN are called subx for extracting the substances, and refx that can be used to find references associated with them. Using these commands I was quickly able to come up with a comprehensive answer to the publication in patents only question using the world’s largest collection of chemical information.
I started by simultaneously entering both the Registry and CAplus files on new STN and ran a search for patents as a document type in CAplus. This produced 9,543,607 patent references in the database. Extracting the substances from these references produced a collection of 49,058,846 compounds in the Registry database. These numbers are pretty staggering, and as was pointed out in the previous post on New STN would not have been possible to produce based on the system limits opposed by the previous versions of STN. See the image below for a look at some of the most recent substances from this collection:
Crossing all of these substances back into CAplus generates 24,824,536 literature references, both patent and non-patent associated with these over 49 million substances. Of these nearly 25 million references, 19,190,577 are not patents. I can now extract the substances from just the non-patent literature references, bring them back to Registry and compare that to my originally extracted patented substances collection of just over 49 million.
When I did this I found that the 19 million non-patent literature references generated 35,654,723 substances themselves. Already, this is a smaller number than the 49 million we started with, but the real question is what happens when this collection is NOTed out of the original collection of patented substances. What we find is that 46,449,600 substances remain when the substances associated with the non-patent literature references are removed from the starting collection of patented substances.
This means that 95% of the substances coming from the patent collection on CAplus did not have a corresponding non-patent literature reference associated with them.
The series of search commands I followed for this are below:
CAplus: 9,543,607 (Patents in CAplus)
L3 subx L2
CAplus: REGISTRY: 49,058,846 (Substances associated with the patents)
L4 refx L3
CAplus: 24,824,536 REGISTRY: (All references associated with the patented substances)
L5 L4 not p/dt
CAplus: 19,190,577 REGISTRY: (Just the non-patent references)
L6 subx L5
CAplus:-REGISTRY: 35654723 (extraction of the substances from the non-patent literature references)
L7 L3 not L6
CAplus:- REGISTRY: 46449600 (the substances found in the patents but not the literature references)
The patenting of chemical substances represents a reasonable percentage of the technologies covered by the world’s patenting authorities, and thus represents a reasonable collection to study to determine how often technologies mentioned in patents are never mentioned elsewhere. In this study the substances associated with every patent included in the CAS literature database were extracted. These discrete chemical entities were then searched, and all literature references associated with them discovered. From these the patent documents were excluded, and again the substances were extracted. A comparison of the substances coming from the non-patent literature references to those coming from the patent references showed that 95% of the patented substances did not appear in the non-patent literature references. Once again, this example only covered chemical technologies, and is thus not applicable to other technology areas, but in this case the percentage of information found in patents that is never published elsewhere is actually significantly larger than the 80% value that has been bandied about for more than 30 years.