+ All Categories
Home > Documents > Marco Valtorta [email protected] January 28, 2005

Marco Valtorta [email protected] January 28, 2005

Date post: 16-Jan-2016
Category:
Upload: adler
View: 51 times
Download: 0 times
Share this document with a friend
Description:
- PowerPoint PPT Presentation
18
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering Causal Discovery from Medical Textual Data Subramani Mani and Gregory F. Cooper Proceedings of the AMIA annual fall symposium, 2000, pp.542--546. Hanley and Belfus Publishers, Philadelphia, PA. Available at: http://omega.cbmi.upmc.edu/~mani/pub/amia_fs2000.pdf Presentation for the Bayesian Networks Reading Club Marco Valtorta [email protected] January 28, 2005
Transcript
Page 1: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Causal Discovery from Medical Textual Data

Subramani Mani and Gregory F. CooperProceedings of the AMIA annual fall symposium, 2000, pp.542--546. Hanley and Belfus Publishers, Philadelphia, PA.

Available at: http://omega.cbmi.upmc.edu/~mani/pub/amia_fs2000.pdf Presentation for the Bayesian Networks Reading

Club

Marco [email protected] 28, 2005

Page 2: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Learning from Textual Data

• Text is ubiquitous• Causal knowledge aids in planning and

decision making– Because it supports manipulation

• Causal relations may represent prior and tacit knowledge

• Learning from textual data is a new and difficult area of research

• “[T]he present paper reports the first investigation of causal knowledge discovery from text.”

Page 3: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Causal BNs

• A causal BN is a BN in which each arc is interpreted as a direct causal influence between parent and child node

• Casual (why?) Markov condition: a variable is independent of its non-descendants given its parents

• Causal (why?) faithfulness condition: variables are independent only if their independence is implied by the causal Markov condition

• Statistical testing assumption: independence tests on a finite dataset are correct with respect to the underlying causal process that generated the dataset

Page 4: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

LCD Algorithm

• Algorithm for finding causal links between pairs of variables

• Assumes Markov condition, faithfulness condition, and statistical testing assumption, and an additional assumption

Page 5: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Assumption 4• Given measured variables X,Y, and Z, if Y

causes Z, and Y and Z are not confounded (i.e., they do not have a common unmeasured cause), then one of the causal networks below must hold:

• In case (1), X and Y areindependent; in case (2), theyare dependent due to X causingY; and in cases (3) and (4),they are confounded

Page 6: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Local Causal Discovery

Consider three measured variables W,X, and Y. We will modelthe ways in which each pair can be causally related as in the tables above. The H variables are unmeasured (latent, hidden)variables. There are 96 ways in which W,X, and Y can interact.This is not a complete list, but Cooper argues that very little is lost.Based on: Cooper, Gregory F. “A Simple Constraint-Based Algorithm for Efficiently Mining ObservationalDatabases for Causal Relationships.” Data Mining and Knowledge Discovery 1, 203-224 (1997).

Page 7: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Three Tests

• D(W, X)• or: Dependent(X,Y)

• D(X,Y)• or: Dependent(Y,Z)

• I(W,Y|X)• or: Independent(X, Z given Y)

Page 8: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

D-Separation Conditions for the 96 Causal Graphs

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . .

Graphs 18, 19, and 20 are the only ones for which all three testsD(W,X), D(X,Y), and I(W,Y|X) hold. In each of the three graphs, X causes Y.

Page 9: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

The three Graphs for Which All Tests Are

Positive

Page 10: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Example: No Causal Link

There is no causal link between Y causes Z.Independent (X, Z given Y) fails.

Page 11: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

LCD Pseudocode

Page 12: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Extra Test

• D(W,Y)• Dependent(X,Y)

• Why? Redundancy, if I understand Cooper correctly.

Page 13: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Limitations of LCD

• Many causal networks are missed, e.g:

• LCD only returns separate pairwise causal relationships, which may need to be assembled.

Page 14: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Time Complexity

• Not too bad, because only three variables at a time are considered

• O(mnnr), where m is the number of cases in the database, n is the number of variables, and r is the number of variables such as X (W in Cooper’s paper), i.e., variables that have no cause (“acausal”)– O(mn) if “few” acausal variables and potential

effects (like Z, or Y in Cooper’s paper)• Space complexity: O(mn), which is the size of the

database

Page 15: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Text Dataset• 2060 ICU discharge summaries (documents)• 1808 unique words appeared in the

documents• Age, gender, and race appeared in 1611

documents and were considered causeless (“acausal”)

• Each of the 1808 words was coded as present (1) or absent (0): in total 1811 variables

• m=1611, n=1811, r=3• only 18 variables of type Z (possible effects)

were considered: nausea, cyrrhosis, dyspnea,…

Page 16: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Results

•One good causal relation was recovered•One bad causal relation was also obtained•A study from infant birth and death records led to more causal relations.

Ref.: Subramani Mani, Gregory F. Cooper “A Study in Causal Discovery from Population-Based Infant Birth and Death Records.” Proceedings of the AMIA Annual Fall Symposium, 1999, p315--319. Hanley and Belfus Publishers, Philadelphia, PA.

Page 17: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Suggested Improvements

• Multi-word phrases• More records• Multivariate causes or effects (?)• Encoding variable-value pairs (e.g.:

serum sodium = high) (?)• The number of occurrences of phrases

in a documents• The location of phrases in a document

Page 18: Marco Valtorta mgv@cse.sc January 28, 2005

UNIVERSITY OF SOUTH CAROLINAUNIVERSITY OF SOUTH CAROLINADepartment of Computer Science and

Engineering

Department of Computer Science and Engineering

Limitations of the Text Study

• Words, not phrases• Present or absent only• Context of a word is not considered

• “hypertensive” and “not hypertensive” are treated the same way!

• Synonyms are treated as different words

• More generally: no linguistic analysis, no domain (semantic, ontological?) information is used


Recommended