In the previous post Google Refine (soon to be called Open Refine and henceforth referred to as simply Refine in the post) was used to clean up Patent Assignee data using the clustering algorithms built into the tool. It was surprising to hear that most readers hadn’t heard of Refine previously and certainly hadn’t applied it to the manipulation of patent data. There was even a question in the comments section of that post on whether the tool could be used to split multiple values in a cell into individual rows as opposed to columns. Continuing with the discussion on using Refine with patent data let’s look at a situation where it can be used to process the cells in several columns this way and how the operations can be coded so they can be repeated again with a different data set in the same format.
In this case a small collection of data for twenty-five older US granted patents from Nike was downloaded from Orbit.com. The data was exported in CSV (comma separated value) format but, as discussed previously, could have been downloaded in XML, XLS or a variety of other formats and still been usable in Refine. While Orbit.com was used in this particular example, most major patent information systems provide an export feature, and the methods described can be used anytime a user needs to separate multiple values in cells into individual rows.
Once the data has been exported in CSV format, a project is initiated in Refine, as described in the previous post, by selecting Create Project in the tool, Choosing the File to be worked on, accepting the parsing options and creating the project. The image below shows what the data initially looks like in this case:
This data comes from FamPat where all of the patent family members are collected into a single record, so in this case, each row represents a single family and the cells contain multiple values associated with the individual family members. Working with patent families, as a discrete unit can be very useful, but there are times when a user might want to delve into the individual family members and explore if there are differences between them. For instance, an analyst might want to know if there are discrete granted patents from a specific country within the family and account for them individually.
Each row, representing a single patent family, has a number associated with it that is found in the FAN column and can be used later if we ever want to regroup the individual family members after we split them out. In order to do that we need to make sure that the number associated with the initial row continues to be associated with each of the new rows we are going to create once we split the values.
The second column in this collection contains the individual patent family members separated by a vertical bar (|) and is labeled PN. In row 1. it can be seen that there are five individual patent documents contained in this family. It can also be seen that there are additional columns, such as the kind column that also contains multiple values and is also separated by a vertical bar. Any column or field that will be analyzed later should be separated before proceeding.
Select Edit Cells followed by Split multi-valued cells… from the drop down associated with the PN column to perform this operation. Refine will ask for the separator used with the default being a comma. In this case a vertical bar was used so the comma should be replaced with a |.
Notice that after this operation takes place the project has gone from 25 rows to 233 rows and that in the PN column there is a single entry for the patent number as opposed to five values as was seen before. In this example let’s also split the Kind, Date and XPN columns. The steps to be followed are the same as what was done with the PN column and should be repeated for all of the columns that need to be further analyzed once the cells have been processed.
The split multi-value operations only creates new rows with data from the columns processed. This is okay if the user only intends to use the specified columns in a subsequent analysis but often data from the other columns need to be added to the newly created rows so all of the rows in the project have entries in all of their cells. This is accomplished by using the Fill Down operation for the columns of interest as seen in the image below:
At this point, the manipulation of the data is finished and the user can export the project so it can be worked on in Excel or another tool. In data collections with a lot of columns the process of splitting values and filling down can take a significant number of steps. Fortunately, Refine can generate code based on the operations performed and if you find that you need to repeat the same series of steps using multiple data collections the tool provides a means for going through the steps once and then saving the programing steps so they can be repeated by copying them from a saved text file and pasting them back into the tool. In order to do this the user needs to click on the Undo/Redo link just next to Facet/Filter link on the upper left hand side of the window. Selecting this link opens the operations window as seen in the image below:
If any of the steps are selected the tool will undo that particular operation and bring the project to the state it was in prior to the running of that step. This can be very handy when you you’re experimenting with your data and find that a particular operation didn’t function in the way that you expected.
In the upper right corner of the Undo/Redo window, buttons can be found for Extract and Apply. In order to save the code associated with the seven operations performed on the data thus far push the Extract button which will produce the image below:
This functions allows the analyst to determine what operations they want to include in the script and automatically selects the code so it can be copied and saved into a text editor with a simple paste command. Saving the text file allows the user to retain these operations and reapply them by opening the file at a later date and copying the code.
Once a new project is opened, select the Undo/Redo link again and this time press the Apply button. This will open a window where the copied code can be pasted into the window and the operations performed again as they were previously. Press the Perform Operations button once the code has been pasted in.
This method of saving JSON code for a series of operations works with all aspects of Refine including the clustering and manual grouping steps used in the previous post on Patent Assignee cleanup. In this fashion, a complicated series of steps or the collective knowledge of manually cleaning assignee names can be saved and re-applied to new data sets as needed.
Having to save the code to a text file to reuse it as opposed to being able to create a code library within the tool is a little inefficient but having the ability to reuse the code in the first place more than makes up for the lack of this feature.
Regardless of the data source, Refine is an important tool for being able to manipulate patent content and provides the analyst with functionality without having to be able to write code.