One of the most polarizing collection of tasks, associated with patent analytics, is the use of machine learning methods for organizing, and prioritizing documents. While these methods have caught on, and are used in many industries, the adoption in the patent information space has been sporadic. There are opponents of these methods, who are concerned about the peculiarities of language used within patent documents, and how these methods can deal with the inherent ambiguities, and proponents, who see a potential tool for assisting with time-consuming review tasks, even if they aren’t fully automatic. Regardless of an individual’s perspective on the value of these methods though, there is little doubt that significant attention is being paid to them. It is in the best interest of all patent practitioners to have a basic understanding of how these methods work, and how they are being applied to patents. This post will provide some background on machine learning methods, and how they apply to the tasks of clustering, classification and spatial concept maps. It will be the first, in a series on machine learning methods for patent analytics. Additional posts in this series will focus on each task individually, and provide practical tips on how to apply it to the analysis of patent documents.
Wikipedia provides the following definition of machine learning:
Machine learning, a branch of Statistical Learning, is about the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders.
The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. Generalization is the property that the system will perform well on unseen data instances; the conditions under which this can be guaranteed are a key object of study in the subfield of computational learning theory also referred to as statistical learning theory.
Continuing with some additional definitions, the terms clustering, and classification are often used interchangeably but are actually quite different from one another. Clustering is normally associated with unsupervised methods of organizing document collections based on a similarity comparison between each member. With a fixed number of clusters identified at the outset, document collections that meet a threshold similarity component are grouped together. Ideally, the documents within a cluster should be similar to one another but dissimilar to documents in the other clusters.
Classification, on the other hand, is usually accomplished with a supervised, machine learning method that uses “learning sets” to identify key attributes of documents in a category. The “learning sets” are small sub-collections, one for each category, generated by the analyst, who decided which test documents should appear in each class. New documents are compared to the learning collections and assigned to a class based on their similarity to the documents that have already been assigned to the category.
Spatial concept mapping, is related to clustering, or classification, since it generally begins with one of these methods, but adds an extra component, identification of relative similarity between the categories created, to the task. The tools involved take the document clusters, or classes, and arrange them in 2-dimensional space by considering the similarity of the documents, or clusters, relative to one another, over the entire collection. Documents that share elements in common are placed closer together spatially, while ones with less similarity, are placed further away.
Now that the tasks associated with machine learning methods have been identified, let’s look at some of the algorithms used to perform them. Knowing a little about these will help analysts understand, and evaluate the tools they decide to use.
When it comes to clustering, the unsupervised machine learning task, the two most often used algorithms in patent analysis tools are k-means, and force directed placement:
The two methods are unsupervised, so they are referred to as clustering, but they take very different approaches to the grouping of documents into categories. K-means looks to create a fixed number of clusters, and moves new documents to the cluster that has the most similarity, to the other documents, in that cluster. Force Directed Placement doesn’t generate clusters, per se, but looks to find a “local” energy minimum where additional perturbation would increase the tension in the collection. Chemists can relate to this method since it resembles the electrostatic, and steric forces that lead to most favored confirmations in small molecule 3-D modeling, and protein folding.
Readers are encouraged to explore the link provide for each algorithm if they are interested in additional details, or the math, behind the operations.
Moving to classification, the supervised machine learning task, two frequently applied algorithms are Artificial Neural Networks (ANNs), and Support Vector Machines (SVMs):
As applied to patent analytics, the most frequently used sources of content, for both clustering, and classification exercises, come from patent classification codes, and raw, or standardized text, from a source document.
Looking at spatial concept maps, the FAQ section on the IN-SPIRE tool, a related cousin of ThemeScape, both originally developed at Pacific Northwest National Laboratories, provides the following explanation of the process used for creating spatial maps starting with a clustering step:
In brief, IN-SPIRE™ creates mathematical representations of the documents, which are then organized into clusters and visualized into “maps” that can be interrogated for analysis.
More specifically, IN-SPIRE™ performs the following steps:
Spatial concept maps can also be made using classification methods. Arguably, the most famous of these is the Kohonen Self Organizing Map (SOM):
Kohonen Self Organizing Maps – a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.
Machine learning methods provide organizational, and prioritization functions that can be applied to patent documents, and if used properly can deliver great value to analysts. This post has provided an introduction to the variety of tasks associated with machine learning methods in patent analytics, and distinguished them from one another. In future posts, each of the three primary tasks, clustering, classification and spatial concept maps, will be covered in detail, using tools designed for the analysis of patent documents, by way of a relevant case study.