Date post: | 22-Jan-2023 |
Category: |
Documents |
Upload: | khangminh22 |
View: | 1 times |
Download: | 0 times |
1
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Textomics: A Dataset for Genomics Data Summary Generation
Anonymous ACL submission
Abstract
Summarizing biomedical discovery from ge-nomics data using natural languages is anessential step in biomedical research but ismostly done manually. Here, we introduceTextomics, a novel dataset of genomics datadescription, which contains 22,273 pairs of ge-nomics data matrix and its summary. Eachsummary is written by the researchers whogenerated the data and associated with a sci-entific paper. Based on this dataset, we studytwo novel tasks: generating textual summaryfrom genomics data matrix and vice versa. In-spired by the successful applications of k near-est neighbors in modeling genomics data, Wepropose a kNN-Vec2Text model to addressthese tasks and observe substantial improve-ment on our dataset. We further illustrate howTextomics can be used to advance other ap-plications, including evaluating scientific pa-per embeddings and generating masked tem-plates for scientific paper understanding. Tex-tomics serves as the first benchmark for gener-ating textual summary for genomics data andwe envision it will be broadly applied to otherbiomedical and natural language processingapplications.
1 Introduction
Modern genomics research has become increas-ingly automated through being roughly divided intothree sequential steps: next-generation sequenc-ing technology produces a massive amount of ge-nomics data, which are in turn processed by bioin-formatics tools to identify key variants and genes,and, ultimately, analyzed by biologists to summa-rize the discovery (Goodwin et al., 2016; Kanehisaand Bork, 2003). In contrast to the first two stepsthat have been automated by new technologies andsoftware, the last step of summarizing discoveryis still largely performed manually, substantiallyslowing down the progress of scientific discovery
(Hwang et al., 2018). A plausible solution is toautomatically summarize the discovery from ge-nomics data using neural text generation, whichhas been successfully applied to radiology reportgeneration (Wang et al., 2021; Yuan et al., 2019)and clinical notes generation (Melamud and Shiv-ade, 2019; Lee, 2018; Miura et al., 2021).
In this paper, we study this novel task of gen-erating sentences to summarize a genomics datamatrix. There are several existing approaches thatdemonstrate encouraging results in generating shortphrases to describe functions of a set of genes(Wang et al., 2018; Zhang et al., 2020; Krameret al., 2014). However, our task is fundamentallydifferent from these ones: the input of our task is amatrix that contains tens of thousands genes, whichcould be more noisy than a set of selected genes;the output of our task is sentences instead of shortphrases or controlled vocabularies.
To study this task, we curate a novel dataset, Tex-tomics, by integrating data from PMC, PubMed,and Gene Expression Omnibus (GEO) (Edgar et al.,2002) (Figure 1). GEO is the default databaserepository for researchers to upload their genomicsdata matrix, such as gene expression matrix andmutation matrix. Each genomics data matrix inGEO is a sample by feature matrix, where samplesare often humans or mice that are sequenced to-gether to study a specific biological problem andfeatures are genes or variants. Each matrix is alsoassociated with a few sentences that are written byresearchers to summarize this data matrix. Afterpre-processing, we obtain 22,273 matrix summarypairs, spanning 9 sequencing technology platforms.Each matrix has on average 2,475 samples and22,796 features. Each summary has on average 46words.
We further propose a novel approach to automat-ically generate summary from a genomics data ma-trix, which is highly noisy and high-dimensional. k
2
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Building Dataset Textomics Dataset Tasks and ApplicationsVec2Text
Text2Vec
Paper similaritysimilar?
Paper understanding
kNN-Vec2Text Model
Language Model
Dataset
...
Matrix Summary
Platform 1
...
Platform 9
...
This study is about [MASK] ...
epitheliumglucose
fibroblast
gene expression
textsummary
query neaestneighbour
This study ...
We findbone ...
The role of ...
Analysis of ...
distance
Self Attention Module
Embedding
Output: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in ...
reweightedembedding
...< , , , >
Platform 2
< , , , >
Platform 9
< , , , >
Platform 1
...
Matrix-Paper Link
Analysis of uterine microenvironment at genomics level such ....
Analysis of uterine microenvironment at gene transformation ....
Analysis of uterine microenvironment at gene expression level ....
Analysis of uterine microenvironment at gene transformation ....
Analysis of uterine microenvironment at gebone protein as the ....
This data identifies the ensemble of stress responsive genes ....
Analysis of uterine microenviro...
Measurement of embryo gene...
This aim of this study is to figure...
This aim of this study is to figure...
This aim of this study is to figure...
a b c d
Figure 1: Flow chart of Textomics. a. Genomics data matrices and summaries are collected from GEO. Scientificpapers are collected from PMC and PubMed. Each data matrix is associated with a unique summary and a uniquescientific paper in Textomics. b. Textomics is divided into 9 sequencing platforms, spanning over various species.Data matrices in the same platforms share the same features and can therefore be used to train a machine learningmodel. c. Textomics can be used as the benchmark for a variety of tasks, including Vec2Text, Text2Vec, measuringpaper similarity, and scientific paper understanding. d. kNN-Vec2Text is developed to address the task of Vec2Text,by first constructing a reference summary using similar genomics data matrix and then unifying these summariesto generate a new summary.
nearest neighbor (kNN) approaches have obtainedgreat success in genomics data by capturing the hid-den modules within it (Levine et al., 2015; Baranet al., 2019). The key idea of our method is to find knearest summaries according to the genomics datasimilarity and then exploit attention mechanism toconvert these k nearest summaries to a new sum-mary. Our method obtained substantial improve-ment in comparison to baseline approaches. Wefurther illustrated how we can generate a genomicsdata matrix from a given summary, offering thepossibility to simulate genomics data from textualdescription. We then introduced how Textomicscan be used as a novel benchmark for measuringscientific paper similarity and evaluating scientificpaper understanding. To the best of our knowledge,Textomics and kNN-Vec2Text together build upthe first large-scale benchmark for genomics datasummary generation, and can be broadly appliedto a variety of natural language processing tasks.
2 Textomics Dataset
We collected genomics data matrices from GeneExpression Omnibus (GEO) (Edgar et al., 2002).The feature of each data matrix is a gene or a vari-ant and the sample of each matrix is an experimen-tal subject, such as an experimental animal or apatient. Each data matrix is associated with anexpert-written summary, describing this data ma-trix. We obtained in total 164,667 matrix-summarypairs, spanning 12,219 sequencing platforms. We
truncated the summary that is longer than 64 words.
Data matrices belonging to the same sequencingplatform share the same set of features, and canthus be used together to train the model. To thisend, we first selected 20,000 features that have thelargest standard deviation and lower missing ratefor each platform and excluded samples that have asubstantially higher missing rate. We then selected9 platforms with the lowest rate of missing valuesand the largest number of matrix-summary pairs.We imputed the resulted data matrix using aver-aging imputation and excluded outliers and non-informative summary (e.g., “Please see our databelow”) through both manual inspection and an au-tomated approach that excluded the summary thatis substantially different from all other summariesbased on pairwise BLEU scores. Finally, each ofthe 9 platforms contains 471 matrix-summary pairson average, presenting a desirable number of train-ing samples to develop data summary generationmodels. We summarized the statistics of these 9platforms in Supplementary Table S1.
Data matrices belonging to the same platformhave distinct samples (e.g., patient samples col-lected from two hospitals). In order to make themcomparable and provide fixed-size features for ma-chine learning models, we used a five-number sum-mary to represent each data matrix. In particular,we calculated the smallest, the first quartile, themedian, the third quartile, and the largest valueof each feature across samples in a specific data
3
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
matrix. We then concatenated these values of allfeatures, resulting in a 100k-dimensional featurevector for each data matrix. This vector will beused as the input to the machine learning model.We used the original summary written by the authoras the output of the machine learning model.
Each data matrix is associated with a scientificpaper, which describes how the authors generatedand used the data. Therefore, the data matrix andthe summary can be used to help embed these pa-pers. We additionally retrieved these papers fromPubMed and PMC databases according to the papertitles enclosed in GEO. We obtained the full textfor those 7,691 freely accessible ones. We will in-troduce two applications that jointly use scientificpapers and matrix-summary pairs in Section 6.
3 Task Description
We aim to accelerate genomics discovery by gen-erating a textual summary given the five-numbersummary-based vector of a genomics data matrix.We refer to the five-number summary-based vectoras gene feature vector for simplicity. Specifically,consider textual summary domain D and gene fea-ture vector domain V , let D = {DD,DV} =
{(di, vi)}Ni=1dist∼ P(D,V) be a dataset contains
N summary-vector pairs sampled from the jointdistribution of these two domains, where di ,〈d1i , d2i , ..., d
ndii 〉 denotes a token sequence and
vi ∈ Rlv denotes the gene feature vector. Heredji ∈ C, C is the vocabulary. We now formally de-
0.0 0.2 0.4 0.6 0.8 1.0Gene feature vector similarity
0.30.40.50.60.70.80.91.0
Sum
mar
y em
bedd
ing
sim
ilarit
y
Spearman correlation=0.45
dens
ity
Figure 2: Density plot showing the Spearman cor-relation between text-based similarity (y-axis) andvector-based similarity (x-axis) on sequencing platformGPL6246. Each dot is a pair of data samples.
fine two cross-domain generation tasks, Vec2Textand Text2Vec, based on our dataset. Given a genefeature vector vi, Vec2Text aims to generate a sum-mary di that could best describe this vector vi;given a textual summary di, Text2Vec aims to gen-erate the gene feature vector vi that di describes.Since we are studying a novel task on a noveldataset, we first examined the feasibility of thistask. To this end, we obtained the dense representa-tion of each textual summary using the pre-trainedSPECTER model (Cohan et al., 2020) and use theserepresentations to calculate a summary-based sim-ilarity between each pair of summaries. We alsocalculated a vector-based similarity based on thegene feature vector using the cosine similarity. Wefound that these two similarity measurements showa substantial agreement (Figure 2, Supplemen-tary Table S2). All 9 platforms achieved a Spear-man correlation greater than 0.2, suggesting thepossibility to generate textual summary from thegene feature vector and vice versa.
4 Methods
4.1 Vec2TextWe first introduce a base model that tries to encodegene expression vectors into the semantic embed-ding space and then decodes it to generate texts.The base model contains a word embedding func-tion Emb(.), a gene feature vector encoder Encv(.)and a decoder Decv(.). Given a gene feature vectorvi, the encoder will first embed the data into a se-mantic representation space s(0)i = Encv(vi), andthen the decoder will start from this representationfor the text generation. The generation process isautoregressive. It generates j-th word d(j)i and itsembedding s(j)i as:
P (d(j)i |s
(<j)i ) = Decv(s
(<j)i ), j = 1, ..., ndi . (1)
Then we sample the next word and obtain its em-bedding as:
s(j)i = Emb(d(j)i ), d
(j)i
sample∼ P (d(j)i |s
(<j)i ). (2)
This model is trained using the following loss func-tion:
Lbase = −1
|DV |
|DV |∑i=1
ndi∑j=1
logP (d(j)i |s(<j)i ). (3)
4.1.1 kNN-Vec2Text ModelThe base model attempts to learn an encoder thatprojects a gene feature vector to a semantic repre-sentation. However, the substantial noise and thehigh-dimensionality of the gene feature vector pose
4
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
great challenges to effectively learn that projection.k-nearest neighbors models have been extensivelyused as the solution to overcome such issues ingenomics data analysis (Levine et al., 2015; Baranet al., 2019). Therefore, one plausible solutionis to explicitly leverage summaries from similargene feature vectors to improve the generation.Inspired by the encouraging performance in us-ing k-nearest neighbors (kNN) in seq2seq models(Khandelwal et al., 2019, 2021) and genomics dataanalysis (Levine et al., 2015; Baran et al., 2019),we propose to convert the Vec2Text problem toa Text2Text problem according to the k-nearestneighbor of each vector.
For a given gene feature vector g, we use eito denote its Euclidean distance to another genefeature vectors vi in D. We then select the sum-maries of k samples that have the minimum Eu-clidean distances as the reference summary listt = [dj1 , ..., djk ], where jm ∈ {1, 2, ..., |D|} de-notes the index of ordered summaries w.r.t the Eu-clidean distance, i.e, ej1 ≤ ej2 ≤ ... ≤ ej|D| .
In addition to alleviating the noise in genomicsdata using the reference summary list (Levine et al.,2015; Baran et al., 2019), our method explicitlyconverts the Vec2Text problem to a Text2Text prob-lem, and can thus seamlessly incorporate manyadvanced pre-trained language models into ourframework. The resulted problem we need to solveis a k sources to one target generation problem.One naive solution is to concatenate the k ref-erence summaries together. However, this con-catenation will make the source text much longerthan the target text and how to order each sum-mary during concatenation also remains unclear.Instead, we propose to transform this probleminto k one-to-one generation problem and thenuse attention-based strategy to fuse them. Con-cretely, let nj = max{nj1 , ..., njk} be the maxi-mum length among all the reference summaries.We first get the representation of summaries xjm =
Emb(djm) = 〈x(1)jm , ..., x(nj)jm〉 for m = 1, ..., k.
We construct fixed-length reference summaries bypadding after the end of each summary with lengthless than nj . We then utilize self-attention module(SA) (Vaswani et al., 2017) to get the aggregatedembedding of each reference with their embed-dings as well as the gene feature vector distance ei.Let Qr,Kr, Vr be the query, key, value matrix ofembedding sequence r = 〈r(1), ..., r(lr)〉, we have:
SA(r) = Attention(Qr,Kr, Vr). (4)
We then calculate the attention score as following:ajm = SA(〈x(1)jm , ..., x
(njk)
jm〉), (5)
scj = SA(〈ej1 · aj1 , ..., ejk · ajk〉), (6)where scj = [scj1 , ..., scjk ] ∈ Rk. The final scoreis then calculated based on the attention scores andtemperature τ as:
wjm =exp(τ · scjm)∑kl=1 exp(τ · scjl)
. (7)
Then, we aggregate embedding sequences by tak-ing weighted averages:
x(l)j =
k∑m=1
wjmx(l)jm, l = 1, ...,nj . (8)
Let P<l,x(d) = PθLM(d(l)|d(<l), x), 0 < l < nd
be the probability distribution of d(l) output by thelanguage model θLM conditioned on the sequencesof the embedding vectors x and the first l-1 se-quence tokens. We feed the aggregated embeddingsequences into the language model to reconstructthe summary d using an autoregressive-based lossfunction:
LkNN-Vec2Text = −1
|DD|∑d∈DD
nd∑l=1
logP<l,xj (d).
(9)
4.2 Text2VecWe model the reverse problem of generating thegene feature vector v from a textual summaryd as a regression problem. Our model is com-posed with a semantic encoder Encd(.) and a read-out head MLP(.). Specifically, the encoder willembed the textual summary into dense represen-tation x = Encd(d), and the readout head willmap the representation to the gene feature vectorv = MLP(x). Then we train this model by mini-mizing the mean square errors:
Lv =
√√√√ 1
|DD|∑vi∈DV
1
ld
ld∑j=1
(vi(j) − v(j)i )2. (10)
5 Results
5.1 Vec2TextTo evaluate the performance of kNN-Vec2Texton the task of Vec2Text, we compared it to thebase model based on Transformer (Vaswani et al.,2017) and GPT-2 (Radford et al., 2019), as wellas Sent-VAE (Bowman et al., 2016). For kNN-Vec2Text, we set k = 4 and τ = 0.1, and used T5
5
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
GPL96
GPL6244
GPL10558
GPL6887
GPL198
GPL570
GPL6246
GPL1261
GPL13534
Textomics platform
0.0
0.1
0.2
0.3BL
EU-1
sco
rekNN-Vec2TextGPT-2Sent-VAETransformer
GPL6887
GPL10558
GPL1261
GPL96
GPL198
GPL570
GPL6244
GPL6246
GPL13534
Textomics platform
10%
30%
50%
70%
Impr
ovem
ent o
ver b
asel
ine SPECTER
SciBERTSentBERTBioBERTBERT
a b
Figure 3: Performance on Vex2Text and Text2Vec using Textomics as the benchmark. a. Bar plot comparing ourmethod kNN-Vec2Text with existing approaches on the ask of Vec2Text across 9 platforms in Textomics. b. Barplot comparing the performance of different scientific paper embedding methods across 9 platforms in Textomics.
Table 1: A case study of the generated text by kNN-Vec2Text. Summaries of the four nearest neighbors in theinput space are shown. The generated text is composed of short spans from four different neighbors (colored inred).
Neighbor 1: Analysis of B16 tumor microenvironments at gene expression level. The hypothesis tested in the presentstudy was that Tregs orchastrated the immune reponse triggered in presence of tumors.
Neighbor 2: This study aims to look at gene expresion profiles between wildtype and Bapx1 knockout cellsof the gutin a E12.5 mouse embryo.
Neighbor 3: The role of bone morphogenetic protein2 in regulating transformation of the uterine stroma during embryoimplantation in mice was investigated by the conditional ablation of Bmp2 in the uterus using the mouse.
Neighbor 4: Measurement of specific gene expression in clinical samples is a promising approach for monitoring therecipient immune status to the graft in organ transplantation.
Generated: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present studywas that Tregs orchestrated the immune reponse triggered in presence of embryo.
Truth: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present studywas that Tregs orchestrated the immune reponse triggered in presence of embryo.
(Raffel et al., 2020) as the language model. Forall 9 platforms, we reported the average perfor-mance under 5-fold cross validation. The resultsof BLEU-1 score are summarized in Figure 3a.We found that kNN-Vec2Text substantially outper-formed other methods by a large margin. Specif-ically, kNN-Vec2Text obtained a 0.206 BLEU-1score on average while none of the other three meth-ods achieved an average BLEU-1 score greater than0.150. The prominent performance of our methoddemonstrates the effectiveness of using a k-nearest-neighbor approach to convert the Vec2Text problemto a Text2Text problem.
To further understand the superior performanceof the kNN-Vec2Text model, we presented a casestudy in Table 1. In this case study, the generatedsummary is highly accurate compared to the groundtruth summary. By examining the summaries ofthe 4 nearest neighbors in the gene feature vec-tor space, we found that the generated summaryis composed of short spans from each individualneighbor, again indicating the advantage of usinga k-nearest neighbor for this task. Our method
leveraged an attention mechanism to unify thesefour neighbors, thus offering an accurate genera-tion. We also observed consistent improvement ofour method over comparison approaches on othermetrics and summarized the results in Supplemen-tary Table S3.
5.2 Text2VecWe next used the Text2Vec task to illustrate howour dataset can be used to compare the performanceof different pre-trained language models. In par-ticular, we compared a recently proposed scien-tific paper embedding method SPECTER (Cohanet al., 2020), which has demonstrated prominentperformance in a variety of scientific paper anal-ysis tasks, with SciBERT (Beltagy et al., 2019),BioBERT (Lee et al., 2020) and SentBERT (Wangand Kuo, 2020) and the vanilla BERT (Devlin et al.,2019). While the other language models directlytake the token sequence as the input, SPECTERmodel needs to take both the abstract and the ti-tle. To make a fair comparison, we concatenatedthe title and the summary as the input for modelsother than SPECTER. For all 9 platforms, we re-
6
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
ported the average performance under 5-fold crossvalidation. We further implemented a simple av-eraging baseline approach that predicts the vectorfor a test summary according to the average vec-tors of training samples. This baseline does notutilize any textual summary and can thus help usassess the effect of using textual summary infor-mation in this task. We used RMSE to evaluatethe performance of all methods. We reported theRMSE improvement of each method over the aver-aging baseline model in Figure 3b. We found thatall methods outperform the baseline approachesby gaining at least 15% improvement, indicatingthe importance of considering textual summary inthis task. SPECTER achieved the best overall per-formance among all five methods, suggesting theadvantage to separately model the title and the ab-stract when embedding scientific papers.
6 Applications
6.1 Evaluate paper embedding via Textomics
Embedding scientific papers is crucial to effectivelyidentify emerging research topics and new knowl-edge from scientific literature. To this end, manymachine learning models have been proposed toembed scientific papers into dense embeddingsand then applied these embeddings for a varietyof downstream applications (Cohan et al., 2020;Lee et al., 2020; Wang and Kuo, 2020; Beltagyet al., 2019; Devlin et al., 2019). However, thereis currently limited golden standard that can mea-sure the similarity between two papers. As a result,existing approaches use surrogate metrics such ascitation relationship, keywords, and user activitiesto evaluate their paper embeddings (Cohan et al.,2020; Chen et al., 2019; Wang et al., 2019).
Textomics can be used to measure these paperembedding approaches by examining the consis-tency between the embedding-based paper similar-ity and the embedding-based summary similaritysince both the paper and the summary are writtenby the same authors. In particular, for a pair ofsummaries di, dj ∈ DD, let ti, tj be the text (e.g.,abstracts) extracted from their corresponding scien-tific papers. Let Encd be the encoder of the paperembedding method we want to evaluate. We firstget their embeddings as:sdi , sdj = Encd(di),Encd(dj) ∈ Rls , (11)
sti , stj = Encd(ti),Encd(tj) ∈ Rls . (12)We then compute the pairwise Euclidean distance
between all pairs of summaries and all pairs ofpaper text as:
sdi,j =
√√√√ ls∑k=1
(s(k)di− s(k)dj )
2 ∈ R, (13)
sti,j =
√√√√ ls∑k=1
(s(k)ti− s(k)tj )2 ∈ R. (14)
To evaluate the quality of the encoder Encd, wecan calculate the Spearman correlation between thepairwise summary similarity and the pairwise textsimilarity. A larger Spearman correlation indicatesthis Encd is more accurate in embedding scientificpapers. As a proof-of-concept, we obtained the fulltext of 7,691 papers in our dataset from the freelyaccessible PubMed Central. We segmented eachpaper into five sections of abstract, introduction,method, result and conclusion. We first compareddifferent paper embedding methods using the ab-stract of a paper. The five embedding methods weconsidered are introduced in section 5.1. SinceSPECTER takes both the title and paragraph as theinput we used the first sentence of the summary asa pseudo-title when encoding the summary. Theresults are summarized in Figure 4a. We foundthat SPECTER was substantially better than othermethods on 8 out of the 9 platforms. SPECTER isspecifically developed to embed scientific papersby processing the title and the abstract separately,whereas other pre-trained language models sim-ply concatenated the title and the abstract. Thesuperior performance of SPECTER suggests theimportance of separately modeling paper title andabstract when embedding scientific papers. Sent-BERT obtained the best performance among fourpre-trained language models, partially due to itsprominent performance in sentence-level embed-ding. We further noticed that the relative perfor-mance among different methods is largely consis-tent with the previous work evaluated on other met-rics (Cohan et al., 2020), demonstrating the high-quality of Textomics.
After observing the superior performance ofSPECTER, we next investigated which section ofthe paper can be best used to assess paper similarity.Although existing paper embedding approaches of-ten leverage the abstract for embedding, other sec-tions, such as introduction and results might also beinformative, especially for paper describing a spe-cific dataset or method. We thus applied SPECTERto embed five different sections of each scientific
7
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
GPL198
GPL13534
GPL6244
GPL570
GPL10558
GPL6246
GPL1261
GPL6887
GPL96
Textomics platform
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Spea
rman
cor
rela
tion Abstract
IntroductionMethodResultConclusion
GPL198
GPL13534
GPL6244
GPL570
GPL10558
GPL6246
GPL1261
GPL6887
GPL96
Textomics platform
0.0
0.1
0.2
0.3
0.4
0.5
0.6Sp
earm
an c
orre
latio
n SPECTERSentBERTSciBERTBioBERTBERT
a b
Figure 4: Performance on using Textomics as the benchmark to evaluate scientific paper embeddings. (A). Bar plotshowing the comparison on embedding scientific papers using Textomics as the benchmark. (B). Bar plot showingthe comparison on SPECTER embedding of different paper sections using Textomics as the benchmark.
paper and used Textomics to evaluate which sectioncan best reflect paper similarity. We observed a con-sistent improvement of using the abstract sectionin comparison to other paper sections (Figure 4B),which is consistent with the intuition that the ab-stract represents a good summary of the scientificpaper, again indicating the reliability of using Tex-tomics to evaluate paper embedding methods.
6.2 Scientific paper understanding
Creating masked sentences and then filling in thesemasks can examine whether the machine learningmodel has properly understood a scientific paper.However, one challenge in such research is howto generate masked sentences that are relevant toa given paper while also ensuring the answer isenclosed in the paper. Our dataset could be usedto automatically generate such masked sentencesusing the summary, which is highly relevant to thepaper but also not overlapped with the paper. Inparticular, we can mask out keywords from thesummary and then use this masked summary asthe question and let a machine learing model tofind the answer from the non-overlapping scientificpaper. Let Cbio be a dictionary that contains bio-logical keywords we want to mask out from thesummary, (di, ti) be a pair of textual summary andparagraph text extracted from its corresponding sci-entific paper. If the j-th word wi = d
(j)i ∈ Cbio in
the summary belongs to Cbio, our proposed task isto predict which word in Cbio is the missing wordin dmasked given ti. The masked summary dmaskedis the same as di except its j-th word is substi-tuted with [PAD]. For simplicity, we only maskat most one token in di. We therefore form ourtask as a multi-class classification problem. Sim-
GPL570
GPL6887
GPL6244
GPL13534
GPL6246
GPL1261
GPL10558
GPL198
GPL96
Textomics platform
0.1
0.3
0.5
0.7
Accu
racy
CovocEupathObi_ieeObiPlanp
EcocoreXpoArgoTrakPremedonto
Figure 5: Bar plot showing the accuracy of filling themasked sentences of ten biomedical categories across9 platforms using Textomics as the benchmark.
ilar to section 6.1, we used the paper abstract asthe paragraph text ti. To generate Cbio, we lever-aged a recently developed biological terminologydataset Graphine (Liu et al., 2021), which providesthe biological phrases spanning 227 categories. Weselected 10 categories that can produce the largestnumber of masked sentences in Textomics. Wemanually filtered ambiguous words and stop words.On average, each category contains 317 keywords.We used a fully connected neural network to per-form the multi-class classification task. The inputfeature is the concatenation of the masked summaryembedding and the paragraph embedding. We usedSPECTER to derive these embeddings as it hasobtained the best performance in our previous anal-ysis. The results are summarized in Figure 5. Weobserved high accuracy on all ten categories, whichare much better than the 0.4% accuracy by randomguessing, indicating the usefulness of our bench-mark in scientific paper understanding. Finally, wefound that the performance of each category varied
8
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
across different platforms, suggesting the possibil-ity to further improve the performance by jointlylearning from all platforms.
7 Related work
Our task is related to existing works that take astructured data as the input and then generate theunstructured text. Different input data modalitiesand related datasets have been considered in theliterature, including text triplets in RDF graphs(Gardent et al., 2017; Ribeiro et al., 2020; Songet al., 2021; Chen et al., 2020)), text-data tables(Lebret et al., 2016; Rebuffel et al., 2021; Duseket al., 2019; Rebuffel et al., 2019; Puduppully andLapata, 2021; Chen et al., 2020), electronic medicalrecords (Lee, 2018; Guan et al., 2018), radiologyreports (Wang et al., 2021; Yuan et al., 2019; Miuraet al., 2021), and other continuous data modalitieswithout explicit textual structures such as image(Lin et al., 2015; Cornia et al., 2020; Ke et al.,2019; Radford et al., 2021), audio (Drossos et al.,2019; Manco et al., 2021; Wu et al., 2021; Meiet al., 2021), and video (Li et al., 2021; Ging et al.,2020; Zhou et al., 2018; Li et al., 2020). Differentfrom these structures, our dataset takes a high di-mensional genomics feature matrix as input, whichdoesn’t exhibit structure and thus substantial differ-ent from other modalities. Moreover, our datasetis the first dataset that aims to convert genomicsfeature vector to textual summary. The substantialnoise and high-dimensionality of genomics datamatrix further pose unique challenges in text gen-eration.
Our kNN-Vec2Text model is inspired by the re-cent success in applying kNN-based language mod-els to machine translation (Khandelwal et al., 2021)and language models (Khandelwal et al., 2019; Heet al., 2021; Ton et al., 2021). The main differ-ence between our methods and their approaches isthat while we try to leverage kNN in the genomicsvector space to construct reference texts, they usekNN in the text embedding space during the au-toregressive generation process to help adjust thesample distribution. There are some other methodsthat can be used to generate text from vectors, suchas (Bowman et al., 2016; Song et al., 2019; Miaoand Blunsom, 2016; Montero et al., 2021; Zhanget al., 2019). Their inputs are latent vectors thatneed to be inferred from the data and do not havespecific meanings, which are different from ourgene feature vectors.
8 Conclusion and future work
In this paper, we have proposed a novel datasetTextomics, containing 22,273 pairs of genomicsmatrix and its corresponding textual summary. Wethen introduce a novel task of Vec2Text based onour dataset. This task aims to generate the tex-tual summary based on the gene feature vector.To address this task, we propose a novel methodkNN-Vec2Text, which constructs the reference textusing nearest neighbours in the gene feature vectorspace and then generates a new summary accord-ing to this reference text. We further introducetwo applications that can be advanced using ourdataset. One application aims at evaluating sci-entific paper similarity according to the similarityof its corresponding data summary, and the otherapplication leverages our dataset to automaticallygenerate masked sentences for scientific paper un-derstanding.
Our method searches for the nearest neighboursby calculating the Euclidean distance between five-number summary vectors of the genomics featurematrix. However, this might lose useful informa-tion lied in the original matrix. It’s worth exploringend-to-end approaches that can learn embeddingsfrom the genomics feature matrix instead of repre-senting them as five-number summary vectors. Onthe Text2Vec side, we are interested in extendingour work to directly generate the whole genomicsfeature matrix instead of the five-number summaryvectors. Also, it would be interesting to jointlylearn the Text2Vec and the Vec2Text tasks, andone potential solution is to further decode the gen-erated vector to reconstruct the embedding of thesummaries in Text2Vec, and leverage the resulteddecoder to predict the embedding of text by usingkNN method in the text embedding space.
To the best of our knowledge, Textomics andkNN-Vec2Text serves as the first large-scale ge-nomics data description benchmark, and we en-vision it will be broadly applied to other naturallanguage processing and biomedical tasks. Onthe biomedical side, summaries in the Textomicsdataset could be used to impute experimentallymeasured gene expression data matrix and serve asadditional features in classifying these genomicsfeature data. On the NLP side, Textomics couldalso be used to help scientific paper analysis tasks,such as paper recommendation (Bai et al., 2020),citation text generation (Luu et al., 2020), and cita-tion prediction (Suzen et al., 2021).
9
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
ReferencesXiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang,
Xiangjie Kong, and Feng Xia. 2020. Scientific paperrecommendation: A survey.
Yael Baran, Akhiad Bercovich, Arnau Sebe-Pedros,Yaniv Lubling, Amir Giladi, Elad Chomsky, Zo-har Meir, Michael Hoichman, Aviezer Lifshitz, andAmos Tanay. 2019. Metacell: analysis of single-cellrna-seq data using k-nn graph partitions. Genomebiology, 20(1):1–19.
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert:A pretrained language model for scientific text.
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew M. Dai, Rafal Jozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace.
Liqun Chen, Guoyin Wang, Chenyang Tao, Ding-han Shen, Pengyu Cheng, Xinyuan Zhang, WenlinWang, Yizhe Zhang, and Lawrence Carin. 2019. Im-proving textual network embedding with global at-tention via optimal transport.
Wenhu Chen, Yu Su, Xifeng Yan, and William YangWang. 2020. Kgpt: Knowledge-grounded pre-training for data-to-text generation.
Arman Cohan, Sergey Feldman, Iz Beltagy, DougDowney, and Daniel S. Weld. 2020. Specter:Document-level representation learning usingcitation-informed transformers.
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi,and Rita Cucchiara. 2020. Meshed-memory trans-former for image captioning.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing.
Konstantinos Drossos, Samuel Lipping, and TuomasVirtanen. 2019. Clotho: An audio captioningdataset.
Ondrej Dusek, Jekaterina Novikova, and Verena Rieser.2019. Evaluating the state-of-the-art of end-to-endnatural language generation: The E2E NLG chal-lenge. CoRR, abs/1901.07931.
Ron Edgar, Michael Domrachev, and Alex E. Lash.2002. Gene expression omnibus: Ncbi gene expres-sion and hybridization array data repository. Nucleicacids research, 30 1.
Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. Creating train-ing corpora for NLG micro-planners. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 179–188, Vancouver, Canada. Associa-tion for Computational Linguistics.
Simon Ging, Mohammadreza Zolfaghari, Hamed Pir-siavash, and Thomas Brox. 2020. Coot: Coopera-tive hierarchical transformer for video-text represen-tation learning.
Sara Goodwin, John D McPherson, and W RichardMcCombie. 2016. Coming of age: ten years ofnext-generation sequencing technologies. NatureReviews Genetics, 17(6):333–351.
Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang.2018. Generation of synthetic electronic medicalrecord text.
Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient nearest neighbor lan-guage models. In EMNLP.
Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. 2018.Single-cell rna sequencing technologies and bioin-formatics pipelines. Experimental & molecularmedicine, 50(8):1–14.
Minoru Kanehisa and Peer Bork. 2003. Bioinfor-matics in the post-sequence era. Nature genetics,33(3):305–310.
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network forimage captioning.
Urvashi Khandelwal, Angela Fan, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2021. Nearest neigh-bor machine translation. ArXiv, abs/2010.00710.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2019. Generalizationthrough memorization: Nearest neighbor languagemodels. arXiv preprint arXiv:1911.00172.
Michael Kramer, Janusz Dutkowski, Michael Yu, Vi-neet Bafna, and Trey Ideker. 2014. Inferring geneontologies from pairwise similarity data. Bioinfor-matics, 30(12):i34–i42.
Remi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2020. Biobert: a pre-trained biomed-ical language representation model for biomedicaltext mining. Bioinformatics, 36(4):1234–1240.
Scott H. Lee. 2018. Natural language generation forelectronic health records.
Jacob H. Levine, Erin F. Simonds, Sean C. Bendall,Kara L. Davis, El ad D. Amir, Michelle D. Tad-mor, Oren Litvin, Harris G. Fienberg, Astraea Jager,Eli R. Zunder, Rachel Finck, Amanda L. Gedman,Ina Radtke, James R. Downing, Dana Pe’er, andGarry P. Nolan. 2015. Data-driven phenotypic dis-section of AML reveals progenitor-like cells that cor-relate with prognosis. Cell, 162(1):184–197.
10
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan,Licheng Yu, and Jingjing Liu. 2020. Hero:Hierarchical encoder for video+language omni-representation pre-training.
Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, andTao Mei. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics.
Tsung-Yi Lin, Michael Maire, Serge Belongie,Lubomir Bourdev, Ross Girshick, James Hays,Pietro Perona, Deva Ramanan, C. Lawrence Zitnick,and Piotr Dollar. 2015. Microsoft coco: Commonobjects in context.
Zequn Liu, Shukai Wang, Yiyang Gu, Ruiyi Zhang,Ming Zhang, and Sheng Wang. 2021. Graphine: Adataset for graph-aware terminology definition gen-eration. arXiv preprint arXiv:2109.04018.
Kelvin Luu, Rik Koncel-Kedziorski, Kyle Lo, IsabelCachola, and Noah A. Smith. 2020. Citation textgeneration. ArXiv, abs/2002.00317.
Ilaria Manco, Emmanouil Benetos, Elio Quinton, andGyorgy Fazekas. 2021. Muscaps: Generating cap-tions for music audio.
Xinhao Mei, Qiushi Huang, Xubo Liu, GengyunChen, Jingqian Wu, Yusong Wu, Jinzheng Zhao,Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao,Mark D. Plumbley, and Wenwu Wang. 2021. Anencoder-decoder based audio captioning systemwith transfer and reinforcement learning.
Oren Melamud and Chaitanya Shivade. 2019. Towardsautomatic generation of shareable synthetic clinicalnotes using neural language models.
Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. In EMNLP.
Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Lan-glotz, and Dan Jurafsky. 2021. Improving factualcompleteness and consistency of image-to-text radi-ology report generation. In Proceedings of the 2021Conference of the North American Chapter of the As-sociation for Computational Linguistics (NAACL).
Ivan Montero, Nikolaos Pappas, and Noah A. Smith.2021. Sentence bottleneck autoencoders from trans-former language models. In EMNLP.
Ratish Puduppully and Mirella Lapata. 2021. Data-to-text generation with macro planning.
Alec Radford, Jong Wook Kim, Chris Hallacy, AdityaRamesh, Gabriel Goh, Sandhini Agarwal, GirishSastry, Amanda Askell, Pamela Mishkin, Jack Clark,Gretchen Krueger, and Ilya Sutskever. 2021. Learn-ing transferable visual models from natural languagesupervision.
Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.
Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21(140):1–67.
Clement Rebuffel, Marco Roberti, Laure Soulier, Geof-frey Scoutheeten, Rossella Cancelliere, and PatrickGallinari. 2021. Controlling hallucinations at wordlevel in data-to-text generation.
Clement Rebuffel, Laure Soulier, GeoffreyScoutheeten, and Patrick Gallinari. 2019. Ahierarchical model for data-to-text generation.
Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent,and Iryna Gurevych. 2020. Modeling global andlocal node contexts for text generation from knowl-edge graphs.
Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang,Kun Xu, Yubin Ge, and Dong Yu. 2021. Structuralinformation preserving for graph-to-text generation.
Tianbao Song, Jingbo Sun, Bo Chen, Weiming Peng,and Jihua Song. 2019. Latent space expanded vari-ational autoencoder for sentence generation. IEEEAccess, 7:144618–144627.
Neslihan Suzen, Alexander Gorban, Jeremy Levesley,and Evgeny Mirkes. 2021. Semantic analysis forautomated evaluation of the potential impact of re-search articles.
Jean-Francois Ton, Walter A. Talbott, Shuangfei Zhai,and Joshua M. Susskind. 2021. Regularized train-ing of nearest neighbor language models. ArXiv,abs/2109.08249.
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.
Bin Wang and C. C. Jay Kuo. 2020. Sbert-wk: A sen-tence embedding method by dissecting bert-basedword models.
Sheng Wang, Jianzhu Ma, Michael Ku Yu, Fan Zheng,Edward W Huang, Jiawei Han, Jian Peng, and TreyIdeker. 2018. Annotating gene sets by mining largeliterature collections with protein networks. In PA-CIFIC SYMPOSIUM ON BIOCOMPUTING 2018:Proceedings of the Pacific Symposium, pages 602–613. World Scientific.
Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang,Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, QianYang, Ricardo Henao, and Lawrence Carin. 2019.Improving textual network learning with variationalhomophilic embeddings.
Yixin Wang, Zihao Lin, Jiang Tian, Zhongchao Shi,Yang Zhang, Jianping Fan, and Zhiqiang He. 2021.Confidence-guided radiology report generation.
11
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar,and Juan Pablo Bello. 2021. Wav2clip: Learningrobust audio representations from clip.
Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo.2019. Automatic radiology report generation basedon multi-view image fusion and medical concept en-richment.
Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen,and Lawrence Carin. 2019. Syntax-infused vari-ational autoencoder for text generation. ArXiv,abs/1906.02181.
Yanjian Zhang, Qin Chen, Yiteng Zhang, ZhongyuWei, Yixu Gao, Jiajie Peng, Zengfeng Huang, Wei-jian Sun, and Xuan-Jing Huang. 2020. Automaticterm name generation for gene ontology: Task anddataset. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:Findings, pages 4705–4710.
Luowei Zhou, Yingbo Zhou, Jason J. Corso, RichardSocher, and Caiming Xiong. 2018. End-to-enddense video captioning with masked transformer.
A Appendices
We provided more details here about our datasetand related experimental results here. In Table S1,we summarized the statistics information of 9 Tex-tomics platforms. There are 3 different 3 speciesacross 9 platforms, including Homo sapiens, Ara-bidopsis thailiana, and Mus musculus. #Sample(All) represents the entire number of samples for 9platforms, #Sample (Vec2Text) represents the num-ber of samples in the subset after BLEU filtering,and #Sample (PMC) represents the number of sam-ples in the subset with full scientific articles.We also represented the results of Spearman cor-relations between text-based similarity and vector-based simlarity across 9 platforms in Table S2. TheSpearman correlations are all higher than 0.2 in ev-ery platform, which shows a substantial agreementbetween text-based similarity and vector-based sim-ilarity.In Table S3, We represented the automatic eval-uation metric scores for vec2text task, which in-cluded BLEU-1, BLEU-2, ROUGE-1, ROUGE-L,METEOR and NIST, which indicated consistentimprovement of our method over comparison ap-proaches on different automatic metrics.
12
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.
Platform Species#Sample #Sample # Sample
#Feature M. R.(All) (PMC) (Vec2Text)
GPL96 H. S. 1,371 353 240 100K 0.19GPL198 A. T. 1,081 194 250 100K 0.03GPL570 H. S. 5,822 1,879 1,004 100K 0.12GPL1261 M. M. 4,563 1,326 1,059 100K 0.09GPL6244 H. S. 1,831 659 307 100K 0.10GPL6246 H. S. 2,366 850 388 100K 0.08GPL6887 M. M. 1,150 407 240 100K 0.09GPL10558 H. S. 2,580 1,261 519 100K 0.11GPL13534 H. S. 1,509 762 234 100K 0.26
Table S1: Statistics of the Textomics data. Each row is a sequencing platform in Textomics. H. S. denotes HomoSapiens. A. T. denotes Arabidopsis Thaliana. M. M. denotes Mus Musculus. M. R. denotes missing rate. All,PMC, Vec2Text represent number of samples without filtering, with associated PMC full text article, and afterusing automated filtering, respectively.
Textomics GPL GPL GPL GPL GPL GPL GPL GPL GPLplatform 96 198 570 1261 6244 6246 6887 10558 13534
Spearman correlation 0.36 0.20 0.24 0.34 0.44 0.45 0.22 0.38 0.30
Table S2: The result for spearman correlation
Platform BLEU-1 ROUGE-1 ROUGE-L METEOR NISTGPL96 0.179 0.233 0.166 0.143 0.817GPL198 0.198 0.257 0.192 0.168 0.889GPL570 0.212 0.269 0.205 0.182 0.936GPL1261 0.229 0.283 0.226 0.202 0.980GPL6244 0.183 0.250 0.179 0.156 0.750GPL6246 0.219 0.269 0.210 0.187 0.950GPL6887 0.198 0.260 0.196 0.171 0.847GPL10558 0.191 0.257 0.177 0.165 0.842GPL13534 0.242 0.332 0.279 0.260 1.124
Table S3: The first result for evaluating paper embedding using textomics