Information about patents approved in the United States is publicly
available. The United States Patent and Trademark Office (USPTO)
provides digital bulk patent files on its website containing basic
details including patent titles, application and issue dates,
classification, and so on. Although files are available for patents
issued during or after 1976, patents from different periods are
accessible in different formats: patents issued between 1976 and 2001
(inclusive) are provided in TXT files; patents issued between 2002 and
2004 (inclusive) are provided in one XML format; and patents issued
during or after 2005 are provided in a second XML format. The
patentr
R package accesses USPTO bulk data files and
converts them to rectangular CSV format so that users do not have to
deal with distinct formats and can work with patent data more
easily.
CRAN hosts the stable version of patentr
and GitHub
hosts the development version. Each of the lines of code below install
the respective version.
Acquiring patent data from the USPTO is straightforward with
patentr
’s get_bulk_patent_data
function.
First, we load patentr
and the packages we’ll need for this
vignette.
library(patentr)
library(tibble) # for the tibble data containers
library(magrittr) # for the pipe (%>%) operator
library(dplyr) # to work with patent data
library(lubridate) # to work with dates
Then, we use it to acquire data from the first 2 weeks in 1976. Since
patentr
stores the data as a local CSV file, we must import
the data into R. For this, we use the read.csv
function.
# acquire data from USPTO
get_bulk_patent_data(
year = rep(1976, 2), # each week must have a corresponding year
week = 1:2, # each week corresponds element-wise to a year
output_file = "temp_output.csv" # output file in which patent data is stored
)
# import data into R
patent_data <- read.csv("temp_output.csv") %>%
as_tibble() %>%
mutate(App_Date = as_date(App_Date),
Issue_Date=as_date(Issue_Date))
# delete local file (optional - but we no longer need it for this tutorial)
file.remove("temp_output.csv")
The patent_data
variable should be equal to the
y1976w1
dataset provided with patentr
. We peek
at the patent data to get a glimpse of its structure.
tail(patent_data)
#> # A tibble: 6 × 9
#> WKU Title App_Date Issue_Date Inventor Assignee ICL_Class References
#> <chr> <chr> <date> <date> <chr> <chr> <chr> <chr>
#> 1 039316408 Automa… 1974-03-22 1976-01-06 Ichiro … Sanyo E… G11B 170… 2946593;3…
#> 2 039316416 Transd… 1974-08-22 1976-01-06 Robert … Interna… G11B 54… 3310792;3…
#> 3 039316424 Magnet… 1973-06-15 1976-01-06 Koichi … Matsush… G11B 51… 2992474;3…
#> 4 039316432 Magnet… 1974-04-08 1976-01-06 Akio Ku… Matsush… G11B 54… 3069815;3…
#> 5 039316440 Jacket… 1975-03-03 1976-01-06 Paul F.… Informa… G11B 230… 3416150;3…
#> 6 039316459 Flexib… 1974-08-29 1976-01-06 Paul D.… Interna… G11B 58… 3852820
#> # ℹ 1 more variable: Claims <chr>
str(patent_data)
#> tibble [1,379 × 9] (S3: tbl_df/tbl/data.frame)
#> $ WKU : chr [1:1379] "RE0286710" "RE0286729" "RE0286737" "RE0286745" ...
#> $ Title : chr [1:1379] "Hydrophone damper assembly" "Pliable tape structure" "Method of preserving perishable products" "Catamenial device" ...
#> $ App_Date : Date[1:1379], format: "1974-08-26" "1975-02-06" ...
#> $ Issue_Date: Date[1:1379], format: "1976-01-06" "1976-01-06" ...
#> $ Inventor : chr [1:1379] "James W. Widenhofer" "Alfred W. Wakeman" "Joseph J. Esty" "Linda S. Guyette" ...
#> $ Assignee : chr [1:1379] "Sparton Corporation" "" "U. C. San Diego Foundation" "" ...
#> $ ICL_Class : chr [1:1379] "B63B 2152;B63B 5102" "E05D 700" "B65B 3104" "A61f 1320" ...
#> $ References: chr [1:1379] "2790186;3329015;3377615;3543228;3543228;3711821;3720909;3803540" "1843170;2611659;3279473;3442415;3851353" "2242686;2814382;3313084" "1222825;1401358;1887526;3085574" ...
#> $ Claims : chr [1:1379] "I claim:1. A hydrophone damper assembly comprising, in combination, an elongatedtube of flexible material havin"| __truncated__ "What is claimed is:1. A flexible tape for joining mating edges of adjacent members,said tape having an X-like c"| __truncated__ "Having described my invention, I now claim:1. Those steps in the method of preserving a perishable product in a"| __truncated__ "I claim:1. A rolled cylindrical tampon .Iadd.having means for conducting body fluidto the interior thereof, sai"| __truncated__ ...
For the recently acquired set of patents, let’s say we are interested
in how long it took for the patents to get issued once the application
was submitted. We can calculate the difference between issue date
(Issue_Date
column) and application date
(App_Date
) column, then either obtain a numerical summary
or visualize the results as a histogram. The code block below does
both.
# calculate time from application to issue (in days)
lag_time <- patent_data %>%
transmute(Lag = Issue_Date - App_Date) %>%
pull(Lag) %>%
as.numeric
# get quantitative summary
summary(lag_time)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 230.0 463.0 599.0 654.1 761.5 9331.0
# plot as histogram
hist(lag_time,
main = "Histogram of delay before issue",
xlab = "Time (days)", ylab = "Count")
In addition to application and issue dates, the downloaded USPTO data contains multiple text columns. More information about these can be found at https://www.uspto.gov/.
Text in boldface corresponds to column names in datasets returned by
get_bulk_patent_data
. Note that the following definitions
for each column in the returned dataset are intuitive, not official,
definitions. For official definitions, visit https://www.uspto.gov/.