Expanding Patent Families and Other Multi-Value Fields with Google Refine (Open Refine)

In the previous post Google Refine (soon to be called Open Refine and henceforth referred to as simply Refine in the post) was used to clean up Patent Assignee data using the clustering algorithms built into the tool. It was surprising to hear that most readers hadn’t heard of Refine previously and certainly hadn’t applied it to the manipulation of patent data. There was even a question in the comments section of that post on whether the tool could be used to split multiple values in a cell into individual rows as opposed to columns. Continuing with the discussion on using Refine with patent data let’s look at a situation where it can be used to process the cells in several columns this way and how the operations can be coded so they can be repeated again with a different data set in the same format.

In this case a small collection of data for twenty-five older US granted patents from Nike was downloaded from Orbit.com. The data was exported in CSV (comma separated value) format but, as discussed previously, could have been downloaded in XML, XLS or a variety of other formats and still been usable in Refine. While Orbit.com was used in this particular example, most major patent information systems provide an export feature, and the methods described can be used anytime a user needs to separate multiple values in cells into individual rows.

Once the data has been exported in CSV format, a project is initiated in Refine, as described in the previous post, by selecting Create Project in the tool, Choosing the File to be worked on, accepting the parsing options and creating the project. The image below shows what the data initially looks like in this case:

Original Data Formatting from Orbit.com

This data comes from FamPat where all of the patent family members are collected into a single record, so in this case, each row represents a single family and the cells contain multiple values associated with the individual family members. Working with patent families, as a discrete unit can be very useful, but there are times when a user might want to delve into the individual family members and explore if there are differences between them. For instance, an analyst might want to know if there are discrete granted patents from a specific country within the family and account for them individually.

Each row, representing a single patent family, has a number associated with it that is found in the FAN column and can be used later if we ever want to regroup the individual family members after we split them out. In order to do that we need to make sure that the number associated with the initial row continues to be associated with each of the new rows we are going to create once we split the values.

The second column in this collection contains the individual patent family members separated by a vertical bar (|) and is labeled PN. In row 1. it can be seen that there are five individual patent documents contained in this family. It can also be seen that there are additional columns, such as the kind column that also contains multiple values and is also separated by a vertical bar. Any column or field that will be analyzed later should be separated before proceeding.

Splitting Multiple Values within a Cell

Select Edit Cells followed by Split multi-valued cells… from the drop down associated with the PN column to perform this operation. Refine will ask for the separator used with the default being a comma. In this case a vertical bar was used so the comma should be replaced with a |.

Replace comma with vertical bar as Multi-Value Separator

Notice that after this operation takes place the project has gone from 25 rows to 233 rows and that in the PN column there is a single entry for the patent number as opposed to five values as was seen before. In this example let’s also split the Kind, Date and XPN columns. The steps to be followed are the same as what was done with the PN column and should be repeated for all of the columns that need to be further analyzed once the cells have been processed.

Results of Split Multi-Value Operation

The split multi-value operations only creates new rows with data from the columns processed. This is okay if the user only intends to use the specified columns in a subsequent analysis but often data from the other columns need to be added to the newly created rows so all of the rows in the project have entries in all of their cells. This is accomplished by using the Fill Down operation for the columns of interest as seen in the image below:

Fill Down Operation Also Located Under Edit Cells Column Drop Down

At this point, the manipulation of the data is finished and the user can export the project so it can be worked on in Excel or another tool. In data collections with a lot of columns the process of splitting values and filling down can take a significant number of steps. Fortunately, Refine can generate code based on the operations performed and if you find that you need to repeat the same series of steps using multiple data collections the tool provides a means for going through the steps once and then saving the programing steps so they can be repeated by copying them from a saved text file and pasting them back into the tool. In order to do this the user needs to click on the Undo/Redo link just next to Facet/Filter link on the upper left hand side of the window. Selecting this link opens the operations window as seen in the image below:

List of Operations Performed During the Session

If any of the steps are selected the tool will undo that particular operation and bring the project to the state it was in prior to the running of that step. This can be very handy when you you’re experimenting with your data and find that a particular operation didn’t function in the way that you expected.

In the upper right corner of the Undo/Redo window, buttons can be found for Extract and Apply. In order to save the code associated with the seven operations performed on the data thus far push the Extract button which will produce the image below:

JSON Code Produced from Extract Function

This functions allows the analyst to determine what operations they want to include in the script and automatically selects the code so it can be copied and saved into a text editor with a simple paste command. Saving the text file allows the user to retain these operations and reapply them by opening the file at a later date and copying the code.

Once a new project is opened, select the Undo/Redo link again and this time press the Apply button. This will open a window where the copied code can be pasted into the window and the operations performed again as they were previously. Press the Perform Operations button once the code has been pasted in.

Paste the JSON Code Here to Perform the Desired Operations Again

This method of saving JSON code for a series of operations works with all aspects of Refine including the clustering and manual grouping steps used in the previous post on Patent Assignee cleanup. In this fashion, a complicated series of steps or the collective knowledge of manually cleaning assignee names can be saved and re-applied to new data sets as needed.

Having to save the code to a text file to reuse it as opposed to being able to create a code library within the tool is a little inefficient but having the ability to reuse the code in the first place more than makes up for the lack of this feature.

Regardless of the data source, Refine is an important tool for being able to manipulate patent content and provides the analyst with functionality without having to be able to write code.

Comments 3

  1. Hi Tony,
    You are certainly doing great service to all of us patent information people by posting this series! The logical next step is to use Open Refine on open data, and here my favorite is Open Patent Services (OPS) from EPO. Refine will load the XML files that OPS returns, but the results are not pretty. However, I just wrote a quick and dirty converter and had some fun experimenting in Refine with the OPS data downloaded in the TSV format.

    For example, you can retrieve your own patent publications by opening the URL:

    http://ops.epo.org/3.0/rest-services/published-data/search/biblio/?q=in%3D%22trippe%2C%20anthony%22

    The returned XML file has a style sheet associated with it so it looks good in the browser, but not so good if you save it and load in Refine. On the other hand, the text file in the following link contains the same information (minus classifications and references cited, at this point), but behaves nicely in Refine. Check it out:

    http://genemob.com/patent/myTSV.txt

    I am going to try next if I could do some clustering of patents based on the title/abstract contents. Do you have any experience with that?

    Thanks again for these great tutorials and Happy Easter,
    Piotr

    1. Hello Piotr,

      This is fantastic! Thank you so much for sharing and you are certainly welcome with regards to the tutorials. I will certainly look more closely at what you have done and write back to you with some additional comments.

      Clustering seems to be a bit of a misnomer in refine since it doesn’t seem to provide document level clustering. The cluster function I wrote about is for grouping of synonymous terms as opposed to the traditional document clustering.

      Thanks again for the comment and I look forward to investigating this further.

      Best regards,
      Tony

      1. Thanks, Tony. Actually I could do a sort of “poor man’s supervised clustering” of documents with this amazing tool. In the example above, I can do Facet/Customized facets/Word facet on the title column, and then click the word “system” to display your software work, or “PROTEASE” and “protease” to get your HIV work. Add a new column, put the categories there, and I’m ready for visualization. I hope that by learning some GREL I could do something more meaningful, like download all Sanofi US applications from last year and divide them into medical devices, insulin, small molecule, antibody-related publications, etc…

Leave a Reply

Your email address will not be published. Required fields are marked *