+ All Categories
Home > Documents > Textomics: A Dataset for Genomics Data Summary Generation

Textomics: A Dataset for Genomics Data Summary Generation

Date post: 22-Jan-2023
Category:
Upload: khangminh22
View: 1 times
Download: 0 times
Share this document with a friend
12
1 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE. Textomics: A Dataset for Genomics Data Summary Generation Anonymous ACL submission Abstract Summarizing biomedical discovery from ge- nomics data using natural languages is an essential step in biomedical research but is mostly done manually. Here, we introduce Textomics, a novel dataset of genomics data description, which contains 22,273 pairs of ge- nomics data matrix and its summary. Each summary is written by the researchers who generated the data and associated with a sci- entific paper. Based on this dataset, we study two novel tasks: generating textual summary from genomics data matrix and vice versa. In- spired by the successful applications of k near- est neighbors in modeling genomics data, We propose a kNN-Vec2Text model to address these tasks and observe substantial improve- ment on our dataset. We further illustrate how Textomics can be used to advance other ap- plications, including evaluating scientific pa- per embeddings and generating masked tem- plates for scientific paper understanding. Tex- tomics serves as the first benchmark for gener- ating textual summary for genomics data and we envision it will be broadly applied to other biomedical and natural language processing applications. 1 Introduction Modern genomics research has become increas- ingly automated through being roughly divided into three sequential steps: next-generation sequenc- ing technology produces a massive amount of ge- nomics data, which are in turn processed by bioin- formatics tools to identify key variants and genes, and, ultimately, analyzed by biologists to summa- rize the discovery (Goodwin et al., 2016; Kanehisa and Bork, 2003). In contrast to the first two steps that have been automated by new technologies and software, the last step of summarizing discovery is still largely performed manually, substantially slowing down the progress of scientific discovery (Hwang et al., 2018). A plausible solution is to automatically summarize the discovery from ge- nomics data using neural text generation, which has been successfully applied to radiology report generation (Wang et al., 2021; Yuan et al., 2019) and clinical notes generation (Melamud and Shiv- ade, 2019; Lee, 2018; Miura et al., 2021). In this paper, we study this novel task of gen- erating sentences to summarize a genomics data matrix. There are several existing approaches that demonstrate encouraging results in generating short phrases to describe functions of a set of genes (Wang et al., 2018; Zhang et al., 2020; Kramer et al., 2014). However, our task is fundamentally different from these ones: the input of our task is a matrix that contains tens of thousands genes, which could be more noisy than a set of selected genes; the output of our task is sentences instead of short phrases or controlled vocabularies. To study this task, we curate a novel dataset, Tex- tomics, by integrating data from PMC, PubMed, and Gene Expression Omnibus (GEO) (Edgar et al., 2002)(Figure 1). GEO is the default database repository for researchers to upload their genomics data matrix, such as gene expression matrix and mutation matrix. Each genomics data matrix in GEO is a sample by feature matrix, where samples are often humans or mice that are sequenced to- gether to study a specific biological problem and features are genes or variants. Each matrix is also associated with a few sentences that are written by researchers to summarize this data matrix. After pre-processing, we obtain 22,273 matrix summary pairs, spanning 9 sequencing technology platforms. Each matrix has on average 2,475 samples and 22,796 features. Each summary has on average 46 words. We further propose a novel approach to automat- ically generate summary from a genomics data ma- trix, which is highly noisy and high-dimensional. k
Transcript

1

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Textomics: A Dataset for Genomics Data Summary Generation

Anonymous ACL submission

Abstract

Summarizing biomedical discovery from ge-nomics data using natural languages is anessential step in biomedical research but ismostly done manually. Here, we introduceTextomics, a novel dataset of genomics datadescription, which contains 22,273 pairs of ge-nomics data matrix and its summary. Eachsummary is written by the researchers whogenerated the data and associated with a sci-entific paper. Based on this dataset, we studytwo novel tasks: generating textual summaryfrom genomics data matrix and vice versa. In-spired by the successful applications of k near-est neighbors in modeling genomics data, Wepropose a kNN-Vec2Text model to addressthese tasks and observe substantial improve-ment on our dataset. We further illustrate howTextomics can be used to advance other ap-plications, including evaluating scientific pa-per embeddings and generating masked tem-plates for scientific paper understanding. Tex-tomics serves as the first benchmark for gener-ating textual summary for genomics data andwe envision it will be broadly applied to otherbiomedical and natural language processingapplications.

1 Introduction

Modern genomics research has become increas-ingly automated through being roughly divided intothree sequential steps: next-generation sequenc-ing technology produces a massive amount of ge-nomics data, which are in turn processed by bioin-formatics tools to identify key variants and genes,and, ultimately, analyzed by biologists to summa-rize the discovery (Goodwin et al., 2016; Kanehisaand Bork, 2003). In contrast to the first two stepsthat have been automated by new technologies andsoftware, the last step of summarizing discoveryis still largely performed manually, substantiallyslowing down the progress of scientific discovery

(Hwang et al., 2018). A plausible solution is toautomatically summarize the discovery from ge-nomics data using neural text generation, whichhas been successfully applied to radiology reportgeneration (Wang et al., 2021; Yuan et al., 2019)and clinical notes generation (Melamud and Shiv-ade, 2019; Lee, 2018; Miura et al., 2021).

In this paper, we study this novel task of gen-erating sentences to summarize a genomics datamatrix. There are several existing approaches thatdemonstrate encouraging results in generating shortphrases to describe functions of a set of genes(Wang et al., 2018; Zhang et al., 2020; Krameret al., 2014). However, our task is fundamentallydifferent from these ones: the input of our task is amatrix that contains tens of thousands genes, whichcould be more noisy than a set of selected genes;the output of our task is sentences instead of shortphrases or controlled vocabularies.

To study this task, we curate a novel dataset, Tex-tomics, by integrating data from PMC, PubMed,and Gene Expression Omnibus (GEO) (Edgar et al.,2002) (Figure 1). GEO is the default databaserepository for researchers to upload their genomicsdata matrix, such as gene expression matrix andmutation matrix. Each genomics data matrix inGEO is a sample by feature matrix, where samplesare often humans or mice that are sequenced to-gether to study a specific biological problem andfeatures are genes or variants. Each matrix is alsoassociated with a few sentences that are written byresearchers to summarize this data matrix. Afterpre-processing, we obtain 22,273 matrix summarypairs, spanning 9 sequencing technology platforms.Each matrix has on average 2,475 samples and22,796 features. Each summary has on average 46words.

We further propose a novel approach to automat-ically generate summary from a genomics data ma-trix, which is highly noisy and high-dimensional. k

2

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Building Dataset Textomics Dataset Tasks and ApplicationsVec2Text

Text2Vec

Paper similaritysimilar?

Paper understanding

kNN-Vec2Text Model

Language Model

Dataset

...

Matrix Summary

Platform 1

...

Platform 9

...

This study is about [MASK] ...

epitheliumglucose

fibroblast

gene expression

textsummary

query neaestneighbour

This study ...

We findbone ...

The role of ...

Analysis of ...

distance

Self Attention Module

Embedding

Output: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in ...

reweightedembedding

...< , , , >

Platform 2

< , , , >

Platform 9

< , , , >

Platform 1

...

Matrix-Paper Link

Analysis of uterine microenvironment at genomics level such ....

Analysis of uterine microenvironment at gene transformation ....

Analysis of uterine microenvironment at gene expression level ....

Analysis of uterine microenvironment at gene transformation ....

Analysis of uterine microenvironment at gebone protein as the ....

This data identifies the ensemble of stress responsive genes ....

Analysis of uterine microenviro...

Measurement of embryo gene...

This aim of this study is to figure...

This aim of this study is to figure...

This aim of this study is to figure...

a b c d

Figure 1: Flow chart of Textomics. a. Genomics data matrices and summaries are collected from GEO. Scientificpapers are collected from PMC and PubMed. Each data matrix is associated with a unique summary and a uniquescientific paper in Textomics. b. Textomics is divided into 9 sequencing platforms, spanning over various species.Data matrices in the same platforms share the same features and can therefore be used to train a machine learningmodel. c. Textomics can be used as the benchmark for a variety of tasks, including Vec2Text, Text2Vec, measuringpaper similarity, and scientific paper understanding. d. kNN-Vec2Text is developed to address the task of Vec2Text,by first constructing a reference summary using similar genomics data matrix and then unifying these summariesto generate a new summary.

nearest neighbor (kNN) approaches have obtainedgreat success in genomics data by capturing the hid-den modules within it (Levine et al., 2015; Baranet al., 2019). The key idea of our method is to find knearest summaries according to the genomics datasimilarity and then exploit attention mechanism toconvert these k nearest summaries to a new sum-mary. Our method obtained substantial improve-ment in comparison to baseline approaches. Wefurther illustrated how we can generate a genomicsdata matrix from a given summary, offering thepossibility to simulate genomics data from textualdescription. We then introduced how Textomicscan be used as a novel benchmark for measuringscientific paper similarity and evaluating scientificpaper understanding. To the best of our knowledge,Textomics and kNN-Vec2Text together build upthe first large-scale benchmark for genomics datasummary generation, and can be broadly appliedto a variety of natural language processing tasks.

2 Textomics Dataset

We collected genomics data matrices from GeneExpression Omnibus (GEO) (Edgar et al., 2002).The feature of each data matrix is a gene or a vari-ant and the sample of each matrix is an experimen-tal subject, such as an experimental animal or apatient. Each data matrix is associated with anexpert-written summary, describing this data ma-trix. We obtained in total 164,667 matrix-summarypairs, spanning 12,219 sequencing platforms. We

truncated the summary that is longer than 64 words.

Data matrices belonging to the same sequencingplatform share the same set of features, and canthus be used together to train the model. To thisend, we first selected 20,000 features that have thelargest standard deviation and lower missing ratefor each platform and excluded samples that have asubstantially higher missing rate. We then selected9 platforms with the lowest rate of missing valuesand the largest number of matrix-summary pairs.We imputed the resulted data matrix using aver-aging imputation and excluded outliers and non-informative summary (e.g., “Please see our databelow”) through both manual inspection and an au-tomated approach that excluded the summary thatis substantially different from all other summariesbased on pairwise BLEU scores. Finally, each ofthe 9 platforms contains 471 matrix-summary pairson average, presenting a desirable number of train-ing samples to develop data summary generationmodels. We summarized the statistics of these 9platforms in Supplementary Table S1.

Data matrices belonging to the same platformhave distinct samples (e.g., patient samples col-lected from two hospitals). In order to make themcomparable and provide fixed-size features for ma-chine learning models, we used a five-number sum-mary to represent each data matrix. In particular,we calculated the smallest, the first quartile, themedian, the third quartile, and the largest valueof each feature across samples in a specific data

3

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

matrix. We then concatenated these values of allfeatures, resulting in a 100k-dimensional featurevector for each data matrix. This vector will beused as the input to the machine learning model.We used the original summary written by the authoras the output of the machine learning model.

Each data matrix is associated with a scientificpaper, which describes how the authors generatedand used the data. Therefore, the data matrix andthe summary can be used to help embed these pa-pers. We additionally retrieved these papers fromPubMed and PMC databases according to the papertitles enclosed in GEO. We obtained the full textfor those 7,691 freely accessible ones. We will in-troduce two applications that jointly use scientificpapers and matrix-summary pairs in Section 6.

3 Task Description

We aim to accelerate genomics discovery by gen-erating a textual summary given the five-numbersummary-based vector of a genomics data matrix.We refer to the five-number summary-based vectoras gene feature vector for simplicity. Specifically,consider textual summary domain D and gene fea-ture vector domain V , let D = {DD,DV} =

{(di, vi)}Ni=1dist∼ P(D,V) be a dataset contains

N summary-vector pairs sampled from the jointdistribution of these two domains, where di ,〈d1i , d2i , ..., d

ndii 〉 denotes a token sequence and

vi ∈ Rlv denotes the gene feature vector. Heredji ∈ C, C is the vocabulary. We now formally de-

0.0 0.2 0.4 0.6 0.8 1.0Gene feature vector similarity

0.30.40.50.60.70.80.91.0

Sum

mar

y em

bedd

ing

sim

ilarit

y

Spearman correlation=0.45

dens

ity

Figure 2: Density plot showing the Spearman cor-relation between text-based similarity (y-axis) andvector-based similarity (x-axis) on sequencing platformGPL6246. Each dot is a pair of data samples.

fine two cross-domain generation tasks, Vec2Textand Text2Vec, based on our dataset. Given a genefeature vector vi, Vec2Text aims to generate a sum-mary di that could best describe this vector vi;given a textual summary di, Text2Vec aims to gen-erate the gene feature vector vi that di describes.Since we are studying a novel task on a noveldataset, we first examined the feasibility of thistask. To this end, we obtained the dense representa-tion of each textual summary using the pre-trainedSPECTER model (Cohan et al., 2020) and use theserepresentations to calculate a summary-based sim-ilarity between each pair of summaries. We alsocalculated a vector-based similarity based on thegene feature vector using the cosine similarity. Wefound that these two similarity measurements showa substantial agreement (Figure 2, Supplemen-tary Table S2). All 9 platforms achieved a Spear-man correlation greater than 0.2, suggesting thepossibility to generate textual summary from thegene feature vector and vice versa.

4 Methods

4.1 Vec2TextWe first introduce a base model that tries to encodegene expression vectors into the semantic embed-ding space and then decodes it to generate texts.The base model contains a word embedding func-tion Emb(.), a gene feature vector encoder Encv(.)and a decoder Decv(.). Given a gene feature vectorvi, the encoder will first embed the data into a se-mantic representation space s(0)i = Encv(vi), andthen the decoder will start from this representationfor the text generation. The generation process isautoregressive. It generates j-th word d(j)i and itsembedding s(j)i as:

P (d(j)i |s

(<j)i ) = Decv(s

(<j)i ), j = 1, ..., ndi . (1)

Then we sample the next word and obtain its em-bedding as:

s(j)i = Emb(d(j)i ), d

(j)i

sample∼ P (d(j)i |s

(<j)i ). (2)

This model is trained using the following loss func-tion:

Lbase = −1

|DV |

|DV |∑i=1

ndi∑j=1

logP (d(j)i |s(<j)i ). (3)

4.1.1 kNN-Vec2Text ModelThe base model attempts to learn an encoder thatprojects a gene feature vector to a semantic repre-sentation. However, the substantial noise and thehigh-dimensionality of the gene feature vector pose

4

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

great challenges to effectively learn that projection.k-nearest neighbors models have been extensivelyused as the solution to overcome such issues ingenomics data analysis (Levine et al., 2015; Baranet al., 2019). Therefore, one plausible solutionis to explicitly leverage summaries from similargene feature vectors to improve the generation.Inspired by the encouraging performance in us-ing k-nearest neighbors (kNN) in seq2seq models(Khandelwal et al., 2019, 2021) and genomics dataanalysis (Levine et al., 2015; Baran et al., 2019),we propose to convert the Vec2Text problem toa Text2Text problem according to the k-nearestneighbor of each vector.

For a given gene feature vector g, we use eito denote its Euclidean distance to another genefeature vectors vi in D. We then select the sum-maries of k samples that have the minimum Eu-clidean distances as the reference summary listt = [dj1 , ..., djk ], where jm ∈ {1, 2, ..., |D|} de-notes the index of ordered summaries w.r.t the Eu-clidean distance, i.e, ej1 ≤ ej2 ≤ ... ≤ ej|D| .

In addition to alleviating the noise in genomicsdata using the reference summary list (Levine et al.,2015; Baran et al., 2019), our method explicitlyconverts the Vec2Text problem to a Text2Text prob-lem, and can thus seamlessly incorporate manyadvanced pre-trained language models into ourframework. The resulted problem we need to solveis a k sources to one target generation problem.One naive solution is to concatenate the k ref-erence summaries together. However, this con-catenation will make the source text much longerthan the target text and how to order each sum-mary during concatenation also remains unclear.Instead, we propose to transform this probleminto k one-to-one generation problem and thenuse attention-based strategy to fuse them. Con-cretely, let nj = max{nj1 , ..., njk} be the maxi-mum length among all the reference summaries.We first get the representation of summaries xjm =

Emb(djm) = 〈x(1)jm , ..., x(nj)jm〉 for m = 1, ..., k.

We construct fixed-length reference summaries bypadding after the end of each summary with lengthless than nj . We then utilize self-attention module(SA) (Vaswani et al., 2017) to get the aggregatedembedding of each reference with their embed-dings as well as the gene feature vector distance ei.Let Qr,Kr, Vr be the query, key, value matrix ofembedding sequence r = 〈r(1), ..., r(lr)〉, we have:

SA(r) = Attention(Qr,Kr, Vr). (4)

We then calculate the attention score as following:ajm = SA(〈x(1)jm , ..., x

(njk)

jm〉), (5)

scj = SA(〈ej1 · aj1 , ..., ejk · ajk〉), (6)where scj = [scj1 , ..., scjk ] ∈ Rk. The final scoreis then calculated based on the attention scores andtemperature τ as:

wjm =exp(τ · scjm)∑kl=1 exp(τ · scjl)

. (7)

Then, we aggregate embedding sequences by tak-ing weighted averages:

x(l)j =

k∑m=1

wjmx(l)jm, l = 1, ...,nj . (8)

Let P<l,x(d) = PθLM(d(l)|d(<l), x), 0 < l < nd

be the probability distribution of d(l) output by thelanguage model θLM conditioned on the sequencesof the embedding vectors x and the first l-1 se-quence tokens. We feed the aggregated embeddingsequences into the language model to reconstructthe summary d using an autoregressive-based lossfunction:

LkNN-Vec2Text = −1

|DD|∑d∈DD

nd∑l=1

logP<l,xj (d).

(9)

4.2 Text2VecWe model the reverse problem of generating thegene feature vector v from a textual summaryd as a regression problem. Our model is com-posed with a semantic encoder Encd(.) and a read-out head MLP(.). Specifically, the encoder willembed the textual summary into dense represen-tation x = Encd(d), and the readout head willmap the representation to the gene feature vectorv = MLP(x). Then we train this model by mini-mizing the mean square errors:

Lv =

√√√√ 1

|DD|∑vi∈DV

1

ld

ld∑j=1

(vi(j) − v(j)i )2. (10)

5 Results

5.1 Vec2TextTo evaluate the performance of kNN-Vec2Texton the task of Vec2Text, we compared it to thebase model based on Transformer (Vaswani et al.,2017) and GPT-2 (Radford et al., 2019), as wellas Sent-VAE (Bowman et al., 2016). For kNN-Vec2Text, we set k = 4 and τ = 0.1, and used T5

5

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

GPL96

GPL6244

GPL10558

GPL6887

GPL198

GPL570

GPL6246

GPL1261

GPL13534

Textomics platform

0.0

0.1

0.2

0.3BL

EU-1

sco

rekNN-Vec2TextGPT-2Sent-VAETransformer

GPL6887

GPL10558

GPL1261

GPL96

GPL198

GPL570

GPL6244

GPL6246

GPL13534

Textomics platform

10%

30%

50%

70%

Impr

ovem

ent o

ver b

asel

ine SPECTER

SciBERTSentBERTBioBERTBERT

a b

Figure 3: Performance on Vex2Text and Text2Vec using Textomics as the benchmark. a. Bar plot comparing ourmethod kNN-Vec2Text with existing approaches on the ask of Vec2Text across 9 platforms in Textomics. b. Barplot comparing the performance of different scientific paper embedding methods across 9 platforms in Textomics.

Table 1: A case study of the generated text by kNN-Vec2Text. Summaries of the four nearest neighbors in theinput space are shown. The generated text is composed of short spans from four different neighbors (colored inred).

Neighbor 1: Analysis of B16 tumor microenvironments at gene expression level. The hypothesis tested in the presentstudy was that Tregs orchastrated the immune reponse triggered in presence of tumors.

Neighbor 2: This study aims to look at gene expresion profiles between wildtype and Bapx1 knockout cellsof the gutin a E12.5 mouse embryo.

Neighbor 3: The role of bone morphogenetic protein2 in regulating transformation of the uterine stroma during embryoimplantation in mice was investigated by the conditional ablation of Bmp2 in the uterus using the mouse.

Neighbor 4: Measurement of specific gene expression in clinical samples is a promising approach for monitoring therecipient immune status to the graft in organ transplantation.

Generated: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present studywas that Tregs orchestrated the immune reponse triggered in presence of embryo.

Truth: Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present studywas that Tregs orchestrated the immune reponse triggered in presence of embryo.

(Raffel et al., 2020) as the language model. Forall 9 platforms, we reported the average perfor-mance under 5-fold cross validation. The resultsof BLEU-1 score are summarized in Figure 3a.We found that kNN-Vec2Text substantially outper-formed other methods by a large margin. Specif-ically, kNN-Vec2Text obtained a 0.206 BLEU-1score on average while none of the other three meth-ods achieved an average BLEU-1 score greater than0.150. The prominent performance of our methoddemonstrates the effectiveness of using a k-nearest-neighbor approach to convert the Vec2Text problemto a Text2Text problem.

To further understand the superior performanceof the kNN-Vec2Text model, we presented a casestudy in Table 1. In this case study, the generatedsummary is highly accurate compared to the groundtruth summary. By examining the summaries ofthe 4 nearest neighbors in the gene feature vec-tor space, we found that the generated summaryis composed of short spans from each individualneighbor, again indicating the advantage of usinga k-nearest neighbor for this task. Our method

leveraged an attention mechanism to unify thesefour neighbors, thus offering an accurate genera-tion. We also observed consistent improvement ofour method over comparison approaches on othermetrics and summarized the results in Supplemen-tary Table S3.

5.2 Text2VecWe next used the Text2Vec task to illustrate howour dataset can be used to compare the performanceof different pre-trained language models. In par-ticular, we compared a recently proposed scien-tific paper embedding method SPECTER (Cohanet al., 2020), which has demonstrated prominentperformance in a variety of scientific paper anal-ysis tasks, with SciBERT (Beltagy et al., 2019),BioBERT (Lee et al., 2020) and SentBERT (Wangand Kuo, 2020) and the vanilla BERT (Devlin et al.,2019). While the other language models directlytake the token sequence as the input, SPECTERmodel needs to take both the abstract and the ti-tle. To make a fair comparison, we concatenatedthe title and the summary as the input for modelsother than SPECTER. For all 9 platforms, we re-

6

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

ported the average performance under 5-fold crossvalidation. We further implemented a simple av-eraging baseline approach that predicts the vectorfor a test summary according to the average vec-tors of training samples. This baseline does notutilize any textual summary and can thus help usassess the effect of using textual summary infor-mation in this task. We used RMSE to evaluatethe performance of all methods. We reported theRMSE improvement of each method over the aver-aging baseline model in Figure 3b. We found thatall methods outperform the baseline approachesby gaining at least 15% improvement, indicatingthe importance of considering textual summary inthis task. SPECTER achieved the best overall per-formance among all five methods, suggesting theadvantage to separately model the title and the ab-stract when embedding scientific papers.

6 Applications

6.1 Evaluate paper embedding via Textomics

Embedding scientific papers is crucial to effectivelyidentify emerging research topics and new knowl-edge from scientific literature. To this end, manymachine learning models have been proposed toembed scientific papers into dense embeddingsand then applied these embeddings for a varietyof downstream applications (Cohan et al., 2020;Lee et al., 2020; Wang and Kuo, 2020; Beltagyet al., 2019; Devlin et al., 2019). However, thereis currently limited golden standard that can mea-sure the similarity between two papers. As a result,existing approaches use surrogate metrics such ascitation relationship, keywords, and user activitiesto evaluate their paper embeddings (Cohan et al.,2020; Chen et al., 2019; Wang et al., 2019).

Textomics can be used to measure these paperembedding approaches by examining the consis-tency between the embedding-based paper similar-ity and the embedding-based summary similaritysince both the paper and the summary are writtenby the same authors. In particular, for a pair ofsummaries di, dj ∈ DD, let ti, tj be the text (e.g.,abstracts) extracted from their corresponding scien-tific papers. Let Encd be the encoder of the paperembedding method we want to evaluate. We firstget their embeddings as:sdi , sdj = Encd(di),Encd(dj) ∈ Rls , (11)

sti , stj = Encd(ti),Encd(tj) ∈ Rls . (12)We then compute the pairwise Euclidean distance

between all pairs of summaries and all pairs ofpaper text as:

sdi,j =

√√√√ ls∑k=1

(s(k)di− s(k)dj )

2 ∈ R, (13)

sti,j =

√√√√ ls∑k=1

(s(k)ti− s(k)tj )2 ∈ R. (14)

To evaluate the quality of the encoder Encd, wecan calculate the Spearman correlation between thepairwise summary similarity and the pairwise textsimilarity. A larger Spearman correlation indicatesthis Encd is more accurate in embedding scientificpapers. As a proof-of-concept, we obtained the fulltext of 7,691 papers in our dataset from the freelyaccessible PubMed Central. We segmented eachpaper into five sections of abstract, introduction,method, result and conclusion. We first compareddifferent paper embedding methods using the ab-stract of a paper. The five embedding methods weconsidered are introduced in section 5.1. SinceSPECTER takes both the title and paragraph as theinput we used the first sentence of the summary asa pseudo-title when encoding the summary. Theresults are summarized in Figure 4a. We foundthat SPECTER was substantially better than othermethods on 8 out of the 9 platforms. SPECTER isspecifically developed to embed scientific papersby processing the title and the abstract separately,whereas other pre-trained language models sim-ply concatenated the title and the abstract. Thesuperior performance of SPECTER suggests theimportance of separately modeling paper title andabstract when embedding scientific papers. Sent-BERT obtained the best performance among fourpre-trained language models, partially due to itsprominent performance in sentence-level embed-ding. We further noticed that the relative perfor-mance among different methods is largely consis-tent with the previous work evaluated on other met-rics (Cohan et al., 2020), demonstrating the high-quality of Textomics.

After observing the superior performance ofSPECTER, we next investigated which section ofthe paper can be best used to assess paper similarity.Although existing paper embedding approaches of-ten leverage the abstract for embedding, other sec-tions, such as introduction and results might also beinformative, especially for paper describing a spe-cific dataset or method. We thus applied SPECTERto embed five different sections of each scientific

7

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

GPL198

GPL13534

GPL6244

GPL570

GPL10558

GPL6246

GPL1261

GPL6887

GPL96

Textomics platform

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Spea

rman

cor

rela

tion Abstract

IntroductionMethodResultConclusion

GPL198

GPL13534

GPL6244

GPL570

GPL10558

GPL6246

GPL1261

GPL6887

GPL96

Textomics platform

0.0

0.1

0.2

0.3

0.4

0.5

0.6Sp

earm

an c

orre

latio

n SPECTERSentBERTSciBERTBioBERTBERT

a b

Figure 4: Performance on using Textomics as the benchmark to evaluate scientific paper embeddings. (A). Bar plotshowing the comparison on embedding scientific papers using Textomics as the benchmark. (B). Bar plot showingthe comparison on SPECTER embedding of different paper sections using Textomics as the benchmark.

paper and used Textomics to evaluate which sectioncan best reflect paper similarity. We observed a con-sistent improvement of using the abstract sectionin comparison to other paper sections (Figure 4B),which is consistent with the intuition that the ab-stract represents a good summary of the scientificpaper, again indicating the reliability of using Tex-tomics to evaluate paper embedding methods.

6.2 Scientific paper understanding

Creating masked sentences and then filling in thesemasks can examine whether the machine learningmodel has properly understood a scientific paper.However, one challenge in such research is howto generate masked sentences that are relevant toa given paper while also ensuring the answer isenclosed in the paper. Our dataset could be usedto automatically generate such masked sentencesusing the summary, which is highly relevant to thepaper but also not overlapped with the paper. Inparticular, we can mask out keywords from thesummary and then use this masked summary asthe question and let a machine learing model tofind the answer from the non-overlapping scientificpaper. Let Cbio be a dictionary that contains bio-logical keywords we want to mask out from thesummary, (di, ti) be a pair of textual summary andparagraph text extracted from its corresponding sci-entific paper. If the j-th word wi = d

(j)i ∈ Cbio in

the summary belongs to Cbio, our proposed task isto predict which word in Cbio is the missing wordin dmasked given ti. The masked summary dmaskedis the same as di except its j-th word is substi-tuted with [PAD]. For simplicity, we only maskat most one token in di. We therefore form ourtask as a multi-class classification problem. Sim-

GPL570

GPL6887

GPL6244

GPL13534

GPL6246

GPL1261

GPL10558

GPL198

GPL96

Textomics platform

0.1

0.3

0.5

0.7

Accu

racy

CovocEupathObi_ieeObiPlanp

EcocoreXpoArgoTrakPremedonto

Figure 5: Bar plot showing the accuracy of filling themasked sentences of ten biomedical categories across9 platforms using Textomics as the benchmark.

ilar to section 6.1, we used the paper abstract asthe paragraph text ti. To generate Cbio, we lever-aged a recently developed biological terminologydataset Graphine (Liu et al., 2021), which providesthe biological phrases spanning 227 categories. Weselected 10 categories that can produce the largestnumber of masked sentences in Textomics. Wemanually filtered ambiguous words and stop words.On average, each category contains 317 keywords.We used a fully connected neural network to per-form the multi-class classification task. The inputfeature is the concatenation of the masked summaryembedding and the paragraph embedding. We usedSPECTER to derive these embeddings as it hasobtained the best performance in our previous anal-ysis. The results are summarized in Figure 5. Weobserved high accuracy on all ten categories, whichare much better than the 0.4% accuracy by randomguessing, indicating the usefulness of our bench-mark in scientific paper understanding. Finally, wefound that the performance of each category varied

8

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

across different platforms, suggesting the possibil-ity to further improve the performance by jointlylearning from all platforms.

7 Related work

Our task is related to existing works that take astructured data as the input and then generate theunstructured text. Different input data modalitiesand related datasets have been considered in theliterature, including text triplets in RDF graphs(Gardent et al., 2017; Ribeiro et al., 2020; Songet al., 2021; Chen et al., 2020)), text-data tables(Lebret et al., 2016; Rebuffel et al., 2021; Duseket al., 2019; Rebuffel et al., 2019; Puduppully andLapata, 2021; Chen et al., 2020), electronic medicalrecords (Lee, 2018; Guan et al., 2018), radiologyreports (Wang et al., 2021; Yuan et al., 2019; Miuraet al., 2021), and other continuous data modalitieswithout explicit textual structures such as image(Lin et al., 2015; Cornia et al., 2020; Ke et al.,2019; Radford et al., 2021), audio (Drossos et al.,2019; Manco et al., 2021; Wu et al., 2021; Meiet al., 2021), and video (Li et al., 2021; Ging et al.,2020; Zhou et al., 2018; Li et al., 2020). Differentfrom these structures, our dataset takes a high di-mensional genomics feature matrix as input, whichdoesn’t exhibit structure and thus substantial differ-ent from other modalities. Moreover, our datasetis the first dataset that aims to convert genomicsfeature vector to textual summary. The substantialnoise and high-dimensionality of genomics datamatrix further pose unique challenges in text gen-eration.

Our kNN-Vec2Text model is inspired by the re-cent success in applying kNN-based language mod-els to machine translation (Khandelwal et al., 2021)and language models (Khandelwal et al., 2019; Heet al., 2021; Ton et al., 2021). The main differ-ence between our methods and their approaches isthat while we try to leverage kNN in the genomicsvector space to construct reference texts, they usekNN in the text embedding space during the au-toregressive generation process to help adjust thesample distribution. There are some other methodsthat can be used to generate text from vectors, suchas (Bowman et al., 2016; Song et al., 2019; Miaoand Blunsom, 2016; Montero et al., 2021; Zhanget al., 2019). Their inputs are latent vectors thatneed to be inferred from the data and do not havespecific meanings, which are different from ourgene feature vectors.

8 Conclusion and future work

In this paper, we have proposed a novel datasetTextomics, containing 22,273 pairs of genomicsmatrix and its corresponding textual summary. Wethen introduce a novel task of Vec2Text based onour dataset. This task aims to generate the tex-tual summary based on the gene feature vector.To address this task, we propose a novel methodkNN-Vec2Text, which constructs the reference textusing nearest neighbours in the gene feature vectorspace and then generates a new summary accord-ing to this reference text. We further introducetwo applications that can be advanced using ourdataset. One application aims at evaluating sci-entific paper similarity according to the similarityof its corresponding data summary, and the otherapplication leverages our dataset to automaticallygenerate masked sentences for scientific paper un-derstanding.

Our method searches for the nearest neighboursby calculating the Euclidean distance between five-number summary vectors of the genomics featurematrix. However, this might lose useful informa-tion lied in the original matrix. It’s worth exploringend-to-end approaches that can learn embeddingsfrom the genomics feature matrix instead of repre-senting them as five-number summary vectors. Onthe Text2Vec side, we are interested in extendingour work to directly generate the whole genomicsfeature matrix instead of the five-number summaryvectors. Also, it would be interesting to jointlylearn the Text2Vec and the Vec2Text tasks, andone potential solution is to further decode the gen-erated vector to reconstruct the embedding of thesummaries in Text2Vec, and leverage the resulteddecoder to predict the embedding of text by usingkNN method in the text embedding space.

To the best of our knowledge, Textomics andkNN-Vec2Text serves as the first large-scale ge-nomics data description benchmark, and we en-vision it will be broadly applied to other naturallanguage processing and biomedical tasks. Onthe biomedical side, summaries in the Textomicsdataset could be used to impute experimentallymeasured gene expression data matrix and serve asadditional features in classifying these genomicsfeature data. On the NLP side, Textomics couldalso be used to help scientific paper analysis tasks,such as paper recommendation (Bai et al., 2020),citation text generation (Luu et al., 2020), and cita-tion prediction (Suzen et al., 2021).

9

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

ReferencesXiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang,

Xiangjie Kong, and Feng Xia. 2020. Scientific paperrecommendation: A survey.

Yael Baran, Akhiad Bercovich, Arnau Sebe-Pedros,Yaniv Lubling, Amir Giladi, Elad Chomsky, Zo-har Meir, Michael Hoichman, Aviezer Lifshitz, andAmos Tanay. 2019. Metacell: analysis of single-cellrna-seq data using k-nn graph partitions. Genomebiology, 20(1):1–19.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert:A pretrained language model for scientific text.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew M. Dai, Rafal Jozefowicz, and Samy Ben-gio. 2016. Generating sentences from a continuousspace.

Liqun Chen, Guoyin Wang, Chenyang Tao, Ding-han Shen, Pengyu Cheng, Xinyuan Zhang, WenlinWang, Yizhe Zhang, and Lawrence Carin. 2019. Im-proving textual network embedding with global at-tention via optimal transport.

Wenhu Chen, Yu Su, Xifeng Yan, and William YangWang. 2020. Kgpt: Knowledge-grounded pre-training for data-to-text generation.

Arman Cohan, Sergey Feldman, Iz Beltagy, DougDowney, and Daniel S. Weld. 2020. Specter:Document-level representation learning usingcitation-informed transformers.

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi,and Rita Cucchiara. 2020. Meshed-memory trans-former for image captioning.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training of deepbidirectional transformers for language understand-ing.

Konstantinos Drossos, Samuel Lipping, and TuomasVirtanen. 2019. Clotho: An audio captioningdataset.

Ondrej Dusek, Jekaterina Novikova, and Verena Rieser.2019. Evaluating the state-of-the-art of end-to-endnatural language generation: The E2E NLG chal-lenge. CoRR, abs/1901.07931.

Ron Edgar, Michael Domrachev, and Alex E. Lash.2002. Gene expression omnibus: Ncbi gene expres-sion and hybridization array data repository. Nucleicacids research, 30 1.

Claire Gardent, Anastasia Shimorina, Shashi Narayan,and Laura Perez-Beltrachini. 2017. Creating train-ing corpora for NLG micro-planners. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 179–188, Vancouver, Canada. Associa-tion for Computational Linguistics.

Simon Ging, Mohammadreza Zolfaghari, Hamed Pir-siavash, and Thomas Brox. 2020. Coot: Coopera-tive hierarchical transformer for video-text represen-tation learning.

Sara Goodwin, John D McPherson, and W RichardMcCombie. 2016. Coming of age: ten years ofnext-generation sequencing technologies. NatureReviews Genetics, 17(6):333–351.

Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang.2018. Generation of synthetic electronic medicalrecord text.

Junxian He, Graham Neubig, and Taylor Berg-Kirkpatrick. 2021. Efficient nearest neighbor lan-guage models. In EMNLP.

Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. 2018.Single-cell rna sequencing technologies and bioin-formatics pipelines. Experimental & molecularmedicine, 50(8):1–14.

Minoru Kanehisa and Peer Bork. 2003. Bioinfor-matics in the post-sequence era. Nature genetics,33(3):305–310.

Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network forimage captioning.

Urvashi Khandelwal, Angela Fan, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2021. Nearest neigh-bor machine translation. ArXiv, abs/2010.00710.

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, LukeZettlemoyer, and Mike Lewis. 2019. Generalizationthrough memorization: Nearest neighbor languagemodels. arXiv preprint arXiv:1911.00172.

Michael Kramer, Janusz Dutkowski, Michael Yu, Vi-neet Bafna, and Trey Ideker. 2014. Inferring geneontologies from pairwise similarity data. Bioinfor-matics, 30(12):i34–i42.

Remi Lebret, David Grangier, and Michael Auli. 2016.Neural text generation from structured data with ap-plication to the biography domain.

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,Donghyeon Kim, Sunkyu Kim, Chan Ho So, andJaewoo Kang. 2020. Biobert: a pre-trained biomed-ical language representation model for biomedicaltext mining. Bioinformatics, 36(4):1234–1240.

Scott H. Lee. 2018. Natural language generation forelectronic health records.

Jacob H. Levine, Erin F. Simonds, Sean C. Bendall,Kara L. Davis, El ad D. Amir, Michelle D. Tad-mor, Oren Litvin, Harris G. Fienberg, Astraea Jager,Eli R. Zunder, Rachel Finck, Amanda L. Gedman,Ina Radtke, James R. Downing, Dana Pe’er, andGarry P. Nolan. 2015. Data-driven phenotypic dis-section of AML reveals progenitor-like cells that cor-relate with prognosis. Cell, 162(1):184–197.

10

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan,Licheng Yu, and Jingjing Liu. 2020. Hero:Hierarchical encoder for video+language omni-representation pre-training.

Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, andTao Mei. 2021. X-modaler: A versatile and high-performance codebase for cross-modal analytics.

Tsung-Yi Lin, Michael Maire, Serge Belongie,Lubomir Bourdev, Ross Girshick, James Hays,Pietro Perona, Deva Ramanan, C. Lawrence Zitnick,and Piotr Dollar. 2015. Microsoft coco: Commonobjects in context.

Zequn Liu, Shukai Wang, Yiyang Gu, Ruiyi Zhang,Ming Zhang, and Sheng Wang. 2021. Graphine: Adataset for graph-aware terminology definition gen-eration. arXiv preprint arXiv:2109.04018.

Kelvin Luu, Rik Koncel-Kedziorski, Kyle Lo, IsabelCachola, and Noah A. Smith. 2020. Citation textgeneration. ArXiv, abs/2002.00317.

Ilaria Manco, Emmanouil Benetos, Elio Quinton, andGyorgy Fazekas. 2021. Muscaps: Generating cap-tions for music audio.

Xinhao Mei, Qiushi Huang, Xubo Liu, GengyunChen, Jingqian Wu, Yusong Wu, Jinzheng Zhao,Shengchen Li, Tom Ko, H Lilian Tang, Xi Shao,Mark D. Plumbley, and Wenwu Wang. 2021. Anencoder-decoder based audio captioning systemwith transfer and reinforcement learning.

Oren Melamud and Chaitanya Shivade. 2019. Towardsautomatic generation of shareable synthetic clinicalnotes using neural language models.

Yishu Miao and Phil Blunsom. 2016. Language as alatent variable: Discrete generative models for sen-tence compression. In EMNLP.

Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Lan-glotz, and Dan Jurafsky. 2021. Improving factualcompleteness and consistency of image-to-text radi-ology report generation. In Proceedings of the 2021Conference of the North American Chapter of the As-sociation for Computational Linguistics (NAACL).

Ivan Montero, Nikolaos Pappas, and Noah A. Smith.2021. Sentence bottleneck autoencoders from trans-former language models. In EMNLP.

Ratish Puduppully and Mirella Lapata. 2021. Data-to-text generation with macro planning.

Alec Radford, Jong Wook Kim, Chris Hallacy, AdityaRamesh, Gabriel Goh, Sandhini Agarwal, GirishSastry, Amanda Askell, Pamela Mishkin, Jack Clark,Gretchen Krueger, and Ilya Sutskever. 2021. Learn-ing transferable visual models from natural languagesupervision.

Alec Radford, Jeff Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J. Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21(140):1–67.

Clement Rebuffel, Marco Roberti, Laure Soulier, Geof-frey Scoutheeten, Rossella Cancelliere, and PatrickGallinari. 2021. Controlling hallucinations at wordlevel in data-to-text generation.

Clement Rebuffel, Laure Soulier, GeoffreyScoutheeten, and Patrick Gallinari. 2019. Ahierarchical model for data-to-text generation.

Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent,and Iryna Gurevych. 2020. Modeling global andlocal node contexts for text generation from knowl-edge graphs.

Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang,Kun Xu, Yubin Ge, and Dong Yu. 2021. Structuralinformation preserving for graph-to-text generation.

Tianbao Song, Jingbo Sun, Bo Chen, Weiming Peng,and Jihua Song. 2019. Latent space expanded vari-ational autoencoder for sentence generation. IEEEAccess, 7:144618–144627.

Neslihan Suzen, Alexander Gorban, Jeremy Levesley,and Evgeny Mirkes. 2021. Semantic analysis forautomated evaluation of the potential impact of re-search articles.

Jean-Francois Ton, Walter A. Talbott, Shuangfei Zhai,and Joshua M. Susskind. 2021. Regularized train-ing of nearest neighbor language models. ArXiv,abs/2109.08249.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.

Bin Wang and C. C. Jay Kuo. 2020. Sbert-wk: A sen-tence embedding method by dissecting bert-basedword models.

Sheng Wang, Jianzhu Ma, Michael Ku Yu, Fan Zheng,Edward W Huang, Jiawei Han, Jian Peng, and TreyIdeker. 2018. Annotating gene sets by mining largeliterature collections with protein networks. In PA-CIFIC SYMPOSIUM ON BIOCOMPUTING 2018:Proceedings of the Pacific Symposium, pages 602–613. World Scientific.

Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang,Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, QianYang, Ricardo Henao, and Lawrence Carin. 2019.Improving textual network learning with variationalhomophilic embeddings.

Yixin Wang, Zihao Lin, Jiang Tian, Zhongchao Shi,Yang Zhang, Jianping Fan, and Zhiqiang He. 2021.Confidence-guided radiology report generation.

11

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar,and Juan Pablo Bello. 2021. Wav2clip: Learningrobust audio representations from clip.

Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo.2019. Automatic radiology report generation basedon multi-view image fusion and medical concept en-richment.

Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen,and Lawrence Carin. 2019. Syntax-infused vari-ational autoencoder for text generation. ArXiv,abs/1906.02181.

Yanjian Zhang, Qin Chen, Yiteng Zhang, ZhongyuWei, Yixu Gao, Jiajie Peng, Zengfeng Huang, Wei-jian Sun, and Xuan-Jing Huang. 2020. Automaticterm name generation for gene ontology: Task anddataset. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:Findings, pages 4705–4710.

Luowei Zhou, Yingbo Zhou, Jason J. Corso, RichardSocher, and Caiming Xiong. 2018. End-to-enddense video captioning with masked transformer.

A Appendices

We provided more details here about our datasetand related experimental results here. In Table S1,we summarized the statistics information of 9 Tex-tomics platforms. There are 3 different 3 speciesacross 9 platforms, including Homo sapiens, Ara-bidopsis thailiana, and Mus musculus. #Sample(All) represents the entire number of samples for 9platforms, #Sample (Vec2Text) represents the num-ber of samples in the subset after BLEU filtering,and #Sample (PMC) represents the number of sam-ples in the subset with full scientific articles.We also represented the results of Spearman cor-relations between text-based similarity and vector-based simlarity across 9 platforms in Table S2. TheSpearman correlations are all higher than 0.2 in ev-ery platform, which shows a substantial agreementbetween text-based similarity and vector-based sim-ilarity.In Table S3, We represented the automatic eval-uation metric scores for vec2text task, which in-cluded BLEU-1, BLEU-2, ROUGE-1, ROUGE-L,METEOR and NIST, which indicated consistentimprovement of our method over comparison ap-proaches on different automatic metrics.

12

1100

1101

1102

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

1161

1162

1163

1164

1165

1166

1167

1168

1169

1170

1171

1172

1173

1174

1175

1176

1177

1178

1179

1180

1181

1182

1183

1184

1185

1186

1187

1188

1189

1190

1191

1192

1193

1194

1195

1196

1197

1198

1199

ACL 2022 Submission ***. Confidential Review Copy. DO NOT DISTRIBUTE.

Platform Species#Sample #Sample # Sample

#Feature M. R.(All) (PMC) (Vec2Text)

GPL96 H. S. 1,371 353 240 100K 0.19GPL198 A. T. 1,081 194 250 100K 0.03GPL570 H. S. 5,822 1,879 1,004 100K 0.12GPL1261 M. M. 4,563 1,326 1,059 100K 0.09GPL6244 H. S. 1,831 659 307 100K 0.10GPL6246 H. S. 2,366 850 388 100K 0.08GPL6887 M. M. 1,150 407 240 100K 0.09GPL10558 H. S. 2,580 1,261 519 100K 0.11GPL13534 H. S. 1,509 762 234 100K 0.26

Table S1: Statistics of the Textomics data. Each row is a sequencing platform in Textomics. H. S. denotes HomoSapiens. A. T. denotes Arabidopsis Thaliana. M. M. denotes Mus Musculus. M. R. denotes missing rate. All,PMC, Vec2Text represent number of samples without filtering, with associated PMC full text article, and afterusing automated filtering, respectively.

Textomics GPL GPL GPL GPL GPL GPL GPL GPL GPLplatform 96 198 570 1261 6244 6246 6887 10558 13534

Spearman correlation 0.36 0.20 0.24 0.34 0.44 0.45 0.22 0.38 0.30

Table S2: The result for spearman correlation

Platform BLEU-1 ROUGE-1 ROUGE-L METEOR NISTGPL96 0.179 0.233 0.166 0.143 0.817GPL198 0.198 0.257 0.192 0.168 0.889GPL570 0.212 0.269 0.205 0.182 0.936GPL1261 0.229 0.283 0.226 0.202 0.980GPL6244 0.183 0.250 0.179 0.156 0.750GPL6246 0.219 0.269 0.210 0.187 0.950GPL6887 0.198 0.260 0.196 0.171 0.847GPL10558 0.191 0.257 0.177 0.165 0.842GPL13534 0.242 0.332 0.279 0.260 1.124

Table S3: The first result for evaluating paper embedding using textomics


Recommended