Date post: | 27-Mar-2015 |
Category: |
Documents |
Upload: | wyatt-reid |
View: | 217 times |
Download: | 0 times |
Name matching for PATSTAT data
Gianluca Tarasconi
KITeS Database Administrator
1
Website: rawpatentdata.blogspot.com
KITeSKnowledge, Internationalization and Technology Studies
KITeS’s mission is understanding the relationship between innovation, technology management, firms’ competitiveness and economic growth in the global economy. KITeS’ research intends to be rigorous, relevant and inter-disciplinary. It focuses on three main areas: innovation, technology management and trade.
22
KITeS –The centre
KITeS was founded in 2008, building upon the experience of research centres such as CESPRI and CRITOM. It’s guested @ Bocconi University.
KITeS is an inter-departmental research centre, integrating researchers from the Economics Dpt., the Management Dpt. and the Institutional Analysis Dpt. KITeS researchers hold doctoral degrees from Yale, Stanford, London School of Economics, Bocconi, Manchester, Leuven, Sussex, Maastricht, and others.
Patent statistics have been widely used at KITeS for many years now, dating back to CESPRI's early research in industrial dynamics.
This tradition has led to the cumulative creation and updating of a large database, known as EP-CESPRI. Inventors' data used so far are organized in a sub-section of such database, known as EP-INV.
… who’s who: www.kites.unibocconi.it
3
The EP-CESPRI Database (i)
The EP‐CESPRI database contains information on patents applied for at the European Patent Office (EPO), from 1978 to October 2009.
The EP‐CESPRI database was first created by making use of information downloaded regularly from EPO Bulletins. Since October 2007 it is based upon applications published on a regular basis by EPO in PATSTAT ; presently, it contains about 2.090.000 patent applications.
A beta version for USPTO was released in 2009 and SIPO (chinese patent office) version is forecasted for 2010.
4
The EP-CESPRI Database (ii)
EP-CESPRI data fall into three broad categories:
1. Patent data, such as the patent’s publication number, its priority/application date, and main/secondary technological class (IPC12‐digit).
2. Applicant data, such as a unique code assigned by KITeS to each applicant after cleaning the applicant’s data, plus the applicant ‘s name and address.
3. Inventor data: such as name, surname, address and a unique code (CODINV) assigned by KITeS to all inventors found to be the same person. This section of EP-CESPRI is also known as EP-INV and it is the one of major interest to today’s seminar
5
EP-INV: From raw data to structured data
Data coming from PATSTAT are cleaned, standardized and re-structured CODINV2 code
Eventually a similarity score is calculated for pairs of inventors who have the same name and surname, but different addresses CODINV code
6
Standardization of inventors’ names and addresses
Original EPO data on inventors come from PATSTAT table TLS206_ASCII, where data are only partially parsed for names, address, city, zip codes.
Further steps are as follows:
1. Cleaning of address data
2. Cleaning of names
3. Computation of similarity scores
7
CODINV2 codes
CODINV codes
Cleaning of address data
Parsed data are given a unique code (CODINV2) and (iteratively) cleaned by:
shifting information contained in wrong fields (like zip code, county…);
standardizing city names or parts of names (e.g.: “Saint” is turned into “St.”);
fixing mistakes in zip codes, according to national post office tables;
In 10/2007 data there were 2.381.991 codinv2 in EP-INV DB out of 3.278.486 PATSTAT person_id (28% less).
8
Example of city cleaning
CITY ZIP
ORIGINALDDR-4203 Bad Dürrenberg
ZIP PARSED Bad Dürrenberg 4203
CITY CLEANED BAD DURRENBERG 4203
ZIP LOOKUP BAD DURRENBERG 06231
9
Cleaning of names
The “name+surname” field was parsed into the following fields: first, second, third name, extension (e.g. Jr, Sr, III), surname, and academic title (e.g. Dr., Prof, Ing….).
This operation was mainly based on two iterative steps: Pairs of inventors with the same address and equal first
name, surname, extension and initial of second or third name are corrected for the third name (e.g.: “Rossi Giovanni Paolo” is turned into “Rossi Giovanni P.”);
Pairs of inventors’ records where 2 out of the 3 fields city, address and name are the same and the remaining one has a low edit distance (Levenshtein/alfanum) are updated on the data for the inventor with the higher number of patents.
10
An example
11
Name Address City Zip codinv2
Tarasconi, Gianluca Via P. Maspero, 24 Milan 1
Tarasconi, Gianluca Via Maspero, 24 IT-20137 Milan 2
Tarasconi, G. c/o university bocconi Milano 20136 3
Tarasconi, Gianluca c/o university bocconi Milano 20136 4
Tarasconi, Gianluca 35, Via Tertulliano Milan 5
Name Address City Zip codinv2
Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1
Tarasconi, Gianluca c/o university bocconi Milano 20136 3
Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5
Further info on cleaning names and addresses
Cleaning of names and address has been realized by MySQL;
The sql code is based on 25 lookup tables and 950 recursive queries;
The aggregation algorithm was quite conservative (to allow ‘new entries’ to be quickly linked);
12
Computation of similarity score
• Inventors data are restructured following a structure person (CODINV) vs person@location (CODINV2)
• All inventors with anything different other than name and surname are compared in pairs, through the Massacrator
SQL routine
13
Introduction of CODINV
14 14
Name Address City Zip codinv2 Codinv
Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1 1
Tarasconi, Gianluca c/o university bocconi
Milano 20136 3 2
Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5 3
Similarity
Score
Workplace: same applicant/ company/ group
Social networks: coinventors in
common, 3 degrees of distance in
coinventorship
Toponymic permanence:
same address, town, county…
Citation’s linkages:
(self)citing or cited
Time lag: how long since
last patent?
IPC: patenting in the same tech fields
Computation of similarity score
15
Scores by categoryWorkplace IPCSame applicant 5 Same IPC code (4 digits) 5Same applicant (the applicant has <50 inventors) 5 Same IPC code (6 digits) 5Same group (if available) 5 Same IPC code (12 digits) 10
Toponymic Permanence Time LagSame city 5 Priority dates differ for >20 years -5Same province 5Same region 5 Citation linkagesSame state (US) 5 Inventor 1 cites inventor 2 5Same address [in different cities; it may
indicate misspellings in the city field] 5 Inventor 1 is cited by inventor 2 5
Social Networks OtherSame coinventor 10 Widespread surname -53 degrees of separation 10 16
Update of CODINV using similarity score
17 17
Name Address City Zip codinv2 Codinv
Tarasconi, Gianluca Via Maspero, 24 Milano 20137 1 1
Tarasconi, Gianluca c/o university bocconi
Milano 20136 3 2
Tarasconi, Gianluca Via Tertulliano, 35 Milano 20135 5 3
codinv
1
1
3
codinv
1
1
1
Algorithm should be run recursively
Intuitively, high similarity scores can be taken as indication of a high probability that the two inventors in the pairs are the same person. Whenever two inventors in a pair are found to be the same the lowest CODINV code is assigned to both inventors.
Finding a threshold value (I)
18
Manual checking of EP-INV records suggest that a large number paired inventors with total score higher than 20 are indeed the same person.
Percentages vary across countries, largely because of the different distribution of frequent surnames. Therefore, no automatic re-assignment of CODINV codes has been performed so far.
In KEINS research data have been extensively checked for IT, FR, SE; the threshold value of the similarity score was set at 15 (median value): inventors in pairs with score >= 15 are then presumed to be the same person, and assigned the same CODINV code.
Finding a threshold value (II)
Manual checking suggests that:
no Type 2 error (false positives) is introduced with this choice, i.e. no pair of inventors are assigned erroneously the same CODINV code)
several Type 1 errors remains, i.e. pairs of inventors who are indeed the same person have scores <15 and are not given the same CODINV code
19
Applying Massacrator to all EPO (I)
distribution of score
1
10
100
1000
10000
100000
1000000
-10 8
25
43
60
78
95
113
130
148
165
183
200
218
235
253
270
288
305
323
340
358
375
score
n c
ou
ple
s
At 10/2007 we get 2.672.671 couples out of 2.363.501 inventors Mode is 0 pts (764946 couples) but 758.471 couples have >= 15pts
20
Applying Massacrator to all EPO (II)
16,78 % of couples are >= 20 pts 22,72% of couples are >= 15 pts
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
60,00%
70,00%
80,00%
90,00%
100,00%
-10 -2 5 13 20 28 35 43 50 58 65 73 80 88 95 103
110
118
125
133
140
148
155
21
Applying Massacrator to all EPO (III)
A raw version of the algorithm for getting a proxy of the possible reductions may be
same IPC (12 digits) OR
same applicant OR
same address OR
3 degrees of distance OR
1 coinventor in common OR
citation linkage OR
same IPC (6 digits) and same country
Compressing 571970 CODINVs out of 2363501 (-24%)
22
Some publications using the EP-INV data
Lissoni, F., Llerena, P., McKelvey, M., and B. Sanditov "Academic Patenting in Europe: New Evidence from the KEINS Database," Research Evaluation, 17(2): 87-102.
Bacchiocchi E., Montobbio F. (2009); Knowledge Diffusion from University and Public Research. A Comparison between US Japan and Europe using Patent Citations. Journal of Technology Transfer, vol.34 (2), pp.169-181.
Breschi S., Lissoni F., Montobbio F. (2008). University patenting and scientific productivity. A quantitative study of Italian academic inventors. European Management Review. The Journal of the European Academy of Management 5(2): 91-109
Corrocher N., Malerba F., Montobbio F. (2007); Schumpeterian Patterns of Innovative Activity in the ICT Field. Research Policy. vol. 36, pp. 418-432
Breschi S., Lissoni F., Montobbio F. (2007). The Scientific Productivity Of Academic Inventors: New Evidence From Italian Data. Economics of Innovation and New Technology, Vol. 16, Issue 2, pp. 101-118Della Malva A, Breschi S, Lissoni F, Montobbio F. (2007). L'attivita' brevettuale dei docenti universitari: L'Italia in un confronto internazionale. Economia e Politica Industriale.v.2 pp.43-70. [pdf]
Montobbio F. (2008); Patenting Activity in Latin American and Caribbean Countries.In World Intellectual Property Organization(WIPO) - Economic Commission for Latin America and the Caribbean (ECLAC) - Study on Intellectual Property Management in Open Economies: A Strategic Vision for Latin America". Forthcoming
23
Future uses of the algorithm (I)
Cross Patent-office match:
Is J. Smith in EPO the same of USPTO ?
Decompression:
Where toponymic data are few (USPTO data FI), a mere data cleaning would group inventors who are not the same; the algorithm could help to avoid type 2 errors
24
Future uses of the algorithm (II)
Companies’ match:
Identify applicants who have similar companies names as the same;
NPL match:
Helping to deduplicate authors / affiliations
25