LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf ·...

transcript

Data Stream Warehousing

Lukasz Golab lgolab@uwaterloo.ca University of Waterloo

Theodore Johnson johnsont@research.aC.com

AT&T Labs -‐ Research

Big Data

•  Every 2 days we create as much informaKon as we did up to 2003 (Eric Schmidt)

•  Becoming easier to produce/collect – Sensors, Web, cheap bandwidth

•  Becoming easier/cheaper to store – Cheap hard disks, commodity hardware

TradiKonal Big Data Workflow •  Wait for data to arrive •  Prepare and load data

–  Into HDFS, key-‐value store, … – or into a database, then index

•  Compute result •  Start over

But •  Many interesKng data sets are “streaming”

– Monitoring (IP networks, infrastructure, smart transportaKon systems and power grids, RFID, system logs, manufacturing)

– TransacKons (stock Kckers, credit card purchases) – User behaviour logs (Web, social media)

Stream Data Workflow

•  For each item or batch of items – Do some processing – Compute/update results

•  Now feasible due to cheap RAM, mulK-‐cores, etc.

“Fast Data” Systems •  Data Stream Management Systems (DSMS)

– Borealis, StreamBase, Gigascope – Simple queries over fast append-‐only data – Results streamed out, usually not stored

•  Key-‐value stores have fast transacKonal response, but analyKcs are difficult – Put/get interface makes correlaKon difficult – AnalyKcs are inefficient on distributed stores

In This Tutorial •  Big Data Management

– Focus on scalability and deep analyKcs, but high latency

•  Fast Data Management – Low latency, but limited capability and no persistent storage

•  Can we do both? – Data Stream Warehousing

Five “V”s of Big Data •  Volume •  Velocity •  Variety

– Data integraKon •  VerificaKon

– Data cleaning •  Value

– Data mining

Outline •  Why? •  What? •  Detailed example •  How?

– Common elements – System architectures – Performance opKmizaKons – Data stream quality

•  Open Problems

Why? •  Could have 2 separate systems, but

– Not clear where to divide the systems – Overhead of moving data from one system to the other

– Harder to develop applicaKons •  Different SQL dialects, etc.

– Historical data provides context for real-‐Kme data – Even tradiKonal analyKcs/reporKng is becoming more real-‐Kme

•  Reduce Kme from ingest to insight

What? •  Load data from a mulKtude of streaming sources – Wide variaKon in data latencies

•  Provide transparent access to both real-‐Kme and historical data

•  Gracefully handle late-‐arriving data •  Schedule queries and updates to materialized views in spite of highly variable workloads – Load shedding by dropping data is not an opKon

MulKtude of streaming sources •  Data becomes most useful when you can correlate results from many sources – Hundreds to thousands of disKnct data feeds

•  Network monitoring –  Correlate twiCer feeds, acKve monitoring streams, and link uKlizaKons to idenKfy trouble spots

•  Smart Grid –  Correlate smart meter readings, line temperature measurements, and phasor measurement units to proacKvely react to overloads and avoid blackouts

0 100000 200000 300000 400000 500000 600000

f Windo

Time ( seconds)

Late-‐arriving data •  Late arriving data is a

common problem for streaming systems.

•  DSMS : data arrives minutes late

•  Stream Warehouse : data can arrive days late

•  Load all data and propagate their results in spite of lateness.

•  AlerKng, troubleshooKng, real-‐Kme data mining all depend on access to real-‐Kme and historical data

•  Hard to draw a boundary between new and old

Transparent Access

Scheduling •  Ensure that the most Kme criKcal applicaKons/views get priority service.

•  Ensure that no applicaKon is starved •  In spite of temporary overload

Network Monitoring •  Darkstar project at AT&T Labs •  MoKvaKng applicaKon for the Data Depot stream warehouse system

•  Data collected: –  Passive and acKve probe measurements, route monitoring, system logs, configuraKon data, customer service Kckets and notes

•  For: – Networking research, data mining, alerKng, troubleshooKng

Darkstar: Mining Vast Amounts of Data

Network

Route monitors (OSPFmon, BGPmon)

Device service monitoring (CIQ, MTANet, STREAM)

AcKve service and connecKvity monitoring

Syslog Config

SNMP Polling (router, link) Nenlow

Deep Packet InspecKon (DPI)

Alarms

Tickets

AuthenKcaKon/ logging (tacacs)

Customer feedback – IVR, Kckets, MTS

IP Backhaul Enterprise IP, VPNs

Ethernet Access

Layer one

Mobility

ARGUS: DetecKng Service Issues… • Goal: detect and isolate ac#onable anomaly events using comprehensive end-‐to-‐end performance measurements (e.g. GS tool) •  SophisKcated anomaly detecKon and heurisKcs •  SpaKal localizaKon •  Accurately accounts for service performance that varies considerably by Kme-‐of-‐day

and locaKon •  Impact: •  Reduced detecKon Kme from days to approx. 15 mins for detecKng data service issues

•  OperaKonal naKon-‐wide monitoring data service performance for 3G and LTE (TCP retransmission, RTT, throughput from GS Tool)

Market

Sub-‐Market Sub-‐Market …

SGSN SGSN

… RNC RNC

SITE SITE …

SITE RNC

SGSN GGSN

Collect end-‐to-‐

end Performance

Approach: Mobility LocalizaKon Hierarchy

Case Example: Silent CE Overload CondiKon • ARGUS detected event: 2 Columbia 3G Ericsson SGSN’s impacKng RNC’s in West Virginia, Norfolk, and Richmond •  No other indicaKon of issue •  Topology highlighted CE used by only impacted SGSNs

•  RCA: “6148 48 port 1gig card is limited to a shared 1 gig bus for each set

of 8 gig ports”

ARGUS alarm: clmamdorpn2 (TCP retransmissions) CE UGlizaGon flaJening

ARGUS As a General Capability… Spike in call drop rate on MSC hrndvacxca1 RTT anomalies (SGSN level)

Outage start 5:30 GMT

First Anomaly 5:40 GMT

CTS Ticket Created 08:21 GMT

Social media (TwiCer) NY outage

LA outage

Node metrics, acKve measurements (CBB, IPAG WIPM delay)…

Mobility customer Kckets (Boston market – PE isolaKon)

• 1. At-‐a-‐glance view of network topology and state

• VisualizaKon to summarize important informaKon on network health •  Color-‐coded

• Complimentary to KckeKng system – reporKng issues below “alarming” status

hCp://ptolemy.research.aC.com/

Use network visualiza9on and convenient data explora9on to help network operators with network health monitoring and service problem troubleshoo9ng

Ptolemy

hCp://ptolemy.research.aC.com/mobility

Assess damage, idenKfy remaining capacity

Loss of many links out of Japan. What’s ler?

Example 1: Japan Earthquake, March 11th 2011

IdenKfy traffic shirs, no congesKon

Increase in link load as traffic re-‐routed

Link load

Example 1: Japan Earthquake, March 11th 2011

Recap •  Load data from mulKple diverse sources •  Transparent access to real-‐Kme and historical data

•  Schedule queries/updates – And materialized views

•  Handle late/out-‐of-‐order data •  Could have two separate systems, but …

Architectures •  DSMS-‐based •  DBMS-‐based •  Hadoop-‐based

DSMS-‐based •  Add ability to store data (e.g., Aurora/Borealis)

Output stream

“StaKc” data set

ConnecKon point

DSMS-‐based

•  Example2: Moirae: History-‐enhanced monitoring

Postgres

Borealis

SQL query Data export

DSMS-‐based •  Example 3: Dejavu: paCern matching over live and historical streams – Actually DBMS-‐based (MySQL)

PaCern matching engine

PaCern match query Data export

DSMS-‐based •  Pros

– Enables real-‐Kme processing with context

•  Cons – Does not enable complex analyKcs

•  Must keep up with live data

– Stores limited history

DBMS-‐based •  Use the query processing and storage engine of a DBMS

•  Add layers for addiKonal services – Fast data load – Temporal parKKoning – Update propagaKon – Scheduling

•  Add stream warehouse-‐specific features and opKmizaKons

DBMS-‐based

•  Design decisions: – Row store (Data Depot/Daytona, Truviso/Postgres) vs. column store (DataCell/MonetDB, SAP HANA, VerKca)

– Disk (Data Depot, Truviso) vs. main memory (DBToaster, SAP HANA)

DBMS-‐based •  Pros:

– Leverage SQL, query opKmizaKon, data storage

•  Cons: – Not quite real-‐Kme

Hadoop / Map-‐Reduce based •  HOP (Hadoop Online Prototype) •  Idea: instead of waiKng for all mappers to finish, send output incrementally from mappers to reducers – periodically invoke reducers on the available data

Hadoop / Map-‐Reduce based

•  MapUpdate/Muppet (Walmart Labs), similar ideas in: Incoop, SCALLA – Reduce: for each key, process all values and return a single output value

– Update: given a new (k,v) pair, return an updated output value using the new pair and state of k

•  And update the state

Hadoop / Map-‐Reduce based •  Nova (Yahoo)

– “Pipelining” between jobs in a workflow (in large batches)

– Pass a “delta” to the next job in a workflow

Hadoop / Map-‐Reduce based •  Pros:

– Leverage scale-‐out and fault tolerance •  Cons:

– Again, not quite real-‐Kme

How? •  Common elements in a stream warehouse – Temporal parKKoning – Update propagaKon / workflow – Temporal dimension tables – Temporal consistency management

Temporal ParKKoning

•  The primary parKKoning field is the record Kmestamp •  Stream data is mostly sorted •  Most new data loads into a new parKKon

–  Avoid rebuilding indices •  Simplified data expiraKon – roll off oldest parKKons

New data

Update PropagaKon / Workflow •  Streaming analyKcs – maintain a system of complex materialized views

•  Push new data through base tables to all dependent tables –  Create new parKKons – Update exisKng parKKons as needed

TwiCer feeds

AcKve measure

Link uKl

Customer complaint

Service alerts

SenKment analysis

Hourly aggregate

Daily aggregate

Temporal Dimension Tables •  Most streaming data describes events

–  Occurs in a point in Kme, or is a measurement during a well-‐defined interval

•  Some streaming data defines condi#ons –  ProperKes of an enKty that endures for a Kme interval –  Temporal dimension tables – Kmestamp is valid Kme interval.

•  Pervasive use –  You can’t evaluate an event without knowing about the environment

–  Link speeds, cell tower locaKons, power grid organizaKon •  Snapshot tables don’t work

–  Late arriving data, recomputaKon, new long-‐term analyses.

Temporal Dimension Table Example SNMP_BytesTransferred

Ip_address Timestamp Bytes_xfered

4.3.2.1 1:05 1,000,000

4.3.2.1 1:10 1,200,000

4.3.2.1 1:15 2,200,000

LinkSpeed Ip_address Tlo Thi Speed

4.3.2.1 12:15 1:15 1,000,000 B/min

4.3.2.1 1:15 -‐ 2,000,000 B/min

Ip_address Timestamp UKlizaKon

4.3.2.1 1:05 .2

4.3.2.1 1:10 .24

4.3.2.1 1:15 .22

LinkUKlizaKon

Temporal Dimension Tables •  Updates

– Snapshots of current status, deltas. •  Snapshot windows in StreamInsight •  Compute from the stream

– Frames – based on a condiKon of records in a stream

–  Interval punctuaKon

OpKmizaKons

•  DB-‐toaster •  MulK-‐version Concurrency Control •  ParKKon Restructuring •  ParKKon Revisions •  Temporal Consistency Management •  Scheduling

DB-‐toaster •  Maintain complex

aggregate views over streaming data.

•  In-‐memory architecture : all storage is via hash table. –  1TB main memory servers are

inexpensive •  Uses novel recursive-‐delta

technique to accelerate maintenance –  CollecKon of support views

that can significantly reduce update Kme.

Join(R,S,T))

Join(S,T)) Join(R,T)) Join(R,S))

T) S) R)

MulK-‐version Concurrency Control •  MVCC allows queries and updates to proceed concurrently – Read isolaKon – Long analyKc queries do not block real-‐Kme updates

•  Single-‐updater MVCC is cheap and easy – Use a directory-‐swap algorithms

•  Encourages use of cloud-‐friendly write-‐once files.

ParKKon Restructuring •  As data ages, its best representaKon changes

– Most recent data : opKmize for fast ingest – Stable data : opKmize for queries – Historical data : minimize storage cost

•  Restructure parKKons as the data ages – MVCC allows data maintenance to occur as a non-‐interfering background task

ParKKon Size

•  New parKKons should match the update increment

•  Problem : parKKon explosion –  1 minute parKKons, 1440 per day, 525,600 per year

•  Merge parKKons as they age

Indexes opKonal

Data Layout •  Write-‐opKmized data

–  Row-‐oriented, lightly indexed, uncompressed •  Read-‐opKmized data

– Highly indexed, lightly compressed, column storage if beneficial

•  Transform as a background task when the data becomes stable –  Combine with parKKon merging

•  Aggressive compression for archival data •  ImplementaKons in SAP HANA and VerKca

ParKKon Revisions

•  Some data always arrives late •  Problem : need to recompute exisKng parKKons – Disk prefers sequenKal access – Write-‐once files : need to recompute the enKre parKKon

•  SoluKon: chain updates to the parKKon – Value of the parKKon is the sum of the primary (anchor) contents plus the updates (revisions).

ParKKon Revisions

•  Problem: Don’t change old parKKons, but what if data arrives out-‐of-‐order?

•  SoluKon: Overflow chains (Truviso)

anchor

revisions

Packet_Stream

Packets

•  Works with “raw” and derived/aggregated data

•  E.g., packet counts:

Data Layout

1000 1200 1150 1400

Packet_Stream

Packets

Packet_counts

Temporal Consistency Management

•  TradiKonal noKon of consistency : a snapshot of the system.

•  Doesn’t apply in a stream warehouse – Late-‐arriving data is common – Different data sources have different Kme lags and different likelihoods of late data

•  Instead, label data by its degree of completeness

0 100000 200000 300000 400000 500000 600000

f Windo

Time ( seconds)

Number of windows per package

Query Stability •  How do I know when the data is stable enough to query?

•  What is stable enough? – Data will never change – Data won’t change much. –  I’ll take whatever is there.

Consistency Levels •  PunctuaKons on parKKons that indicate completeness.

•  Example (simple) collecKon of consistency levels – Open : The parKKon should have some data in it. –  Closed : The parKKon will not change. –  Complete : the parKKon will not change, and all data has been received.

•  Closed is a guess – WeaklyClosed, StronglyClosed

•  Infer at base tables, propagate inferences to materialized views.

Workflow Scheduling •  Need to limit resource use to avoid thrashing.

– Hundreds of tables to update, limited (CPU, memory, cache, network) resources.

–  Exclusive resources: non-‐preempKve scheduling. •  Ensure that high-‐priority jobs can execute

–  Real-‐Kme scheduling •  Measures of lateness:

–  Staleness : difference between current Kme and most recent data.

–  Tardiness : the difference between a task deadline and task compleKon.

Workflow Scheduling •  Staleness funcKon:

difference between current Kme and most recent data loaded

•  Hierarchies of views with highly varying execuKon Kmes.

9:30 9:45 10:00 10:15

TwiCer feeds

AcKve measure

Link uKl

Customer complaint

Service alerts

SenKment analysis

Hourly aggregate

Daily aggregate

fast frequent

slow infrequent

Bounded Tardiness Scheduling •  Bound on the maximum tardiness of any task in a task set.

•  If update jobs are scheduled regularly, bounded tardiness => bounded staleness

•  Most real-‐Kme scheduling algorithms have bounded tardiness –  EDF, minimum slack, etc. –  There can be differences in the tardiness bounds

•  Pick a heurisKc that works well –  E.g. pick the task that provides the largest marginal reducKon in staleness.

Track Scheduling •  ComplicaKon: Large differences in task execuKon Kme – Update a base table with 1 minute of data vs. compute a daily aggregate.

•  Tardiness bounds depend on the largest task execuKon Kmes. –  Long tasks block short criKcal tasks.

•  Track Scheduling : –  parKKon tasks by execuKon Kme. –  Restrict the number of long tasks that can execute concurrently

–  Reserve resources for short criKcal tasks

Transient Overload •  Common source of overload : catch-‐up processing. – A feed breaks for a day, then is restored. –  The source schema changes, requiring a pause in processing to change update procedures.

– New tables load a long history •  Update Chopping

–  Break a (temporally) long update into short segments. •  Update period adjustment

– Decrease the period of backlogged tables to use up (but not oversubscribe) available resources.

Data Stream Quality

•  New data quality problems – SystemaKc errors in machine-‐generated streams – Correlated glitches – Missing/delayed data

•  New semanKcs – RelaKonal data: keys, FDs, CFDs – Streaming/temporal data: order, arrival frequency (sequenKal dependencies), conservaKon laws

Open Problems •  Hybrid system architectures and cross-‐system opKmizaKons

•  Big and fast analyKcs as a cloud service •  Big/fast data mining •  Data stream quality/profiling •  Complexity management and administraKon of a big/fast data management system

Bibliography •  ApplicaKons

–  Smart Grid •  hCp://energy.gov/oe/technology-‐development/smart-‐grid

–  Semiconductor Manufacturing •  hCp://www.appliedmaterials.com/technologies/library/techedge-‐prizm

•  hCp://www.extremetech.com/extreme/155588-‐applied-‐materials-‐designs-‐tools-‐to-‐leverage-‐big-‐data-‐and-‐build-‐beCer-‐chips

•  Networking ApplicaKons –  C. Kalmanek et al., Darkstar: Using Exploratory Data Mining to Raise the

Bar on Network Reliability and Performance, DRCN 2009. –  H. Yan, A. Flavel, Z. Ge, A. Gerber, D. Massey, C. Papadopoulos, H. Shah, J.

Yates: Argus: End-‐to-‐end service anomaly detecKon and localizaKon from an ISP's point of view. INFOCOM 2012:2756-‐2760

Bibliography •  DSMS-‐based systems

– D. J. Abadi, D. Carney, U. ÇeKntemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, S. B. Zdonik: Aurora: a new model and architecture for data stream management. VLDB J. 12(2): 120-‐139 (2003)

– M. Balazinska, Y. C. Kwon, N. Kuchta, D. Lee: Moirae: History-‐Enhanced Monitoring. CIDR 2007: 375-‐386

– N. Dindar, P. M. Fischer, M. Soner, N. Tatbul: Efficiently correlaKng complex events over live and archived data streams. DEBS 2011: 243-‐254

Bibliography

•  DBMS-‐based systems –  Truviso : S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N. Thombre: ConKnuous analyKcs over disconKnuous streams. SIGMOD 2010:1081-‐1092

– DataCell: E. Liarou, R. Goncalves, S. Idreos: ExploiKng the power of relaKonal databases for efficient stream processing. EDBT 2009: 323-‐334

–  L. Golab, T. Johnson, J. S. Seidel, V. Shkapenyuk: Stream warehousing with DataDepot. SIGMOD Conference 2009: 847-‐854

Bibliography •  Hadoop / Map-‐Reduce Based Systems

–  T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears: MapReduce Online. NSDI 2010: 313-‐328

–  W. Lam, L. Liu, S. T. S. Prasad, A. Rajaraman, Z. Vacheri, A. H.i Doan: Muppet: MapReduce-‐Style Processing of Fast Data. PVLDB 5(12): 1814-‐1825 (2012)

–  C. Olston, G. Chiou, L. Chitnis, F. Liu, Y. Han, M. Larsson, A. Neumann, V. B. N. Rao, V. Sankarasubramanian, S. Seth, C. Tian, T. ZiCornell, X. Wang: Nova: conKnuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090

–  B. Li, E. Mazur, Y. Diao, A. McGregor, P. J. Shenoy: SCALLA: A Planorm for Scalable One-‐Pass AnalyKcs Using MapReduce. ACM Trans. Database Syst. 37(4): 27 (2012)

–  P. BhatoKa, A. Wieder, R. Rodrigues, U. A. Acar, R. Pasquin: Incoop: MapReduce for incremental computaKons. SoCC 2011: 7

Bibliography

•  Late Arriving Data –  S. Krishnamurthy et al., ConKnuous analyKcs over disconKnuous

streams, SIGMOD 2010, 1081-‐1092 –  J. Li. K.Ture, V. Shkapenyuk, V. Papadimos, T. Johnson, D. Maier, Out-‐

of-‐order processing: a new architecture for high-‐performance stream systems, PVLDB 1(1): 274-‐288 (2008).

–  Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse. CIDR 2011: 114-‐122

Bibliography •  Update PropagaKon / Workflow

–  T. Johnson, V. Shkapenyuk: Update PropagaKon in a Streaming Warehouse. SSDBM 2011: 129-‐149

–  C. Olston et al. Nova: conKnuous Pig/Hadoop workflows. SIGMOD Conference 2011: 1081-‐1090

•  Temporal Dimension Tables –  Interval Event Stream Processing, M. Li, M. Mani, E. A. Rundensteiner., D. Wang, T Lin, DEBS 2008

–  David Maier, Michael Grossniklaus, Sharmadha Moorthy, KrisKn Ture: Capturing episodes: may the frame be with you. DEBS 2012:1-‐11

–  Snapshot windows: hCp://msdn.microsor.com/en-‐us/library/ff518550.aspx

Bibliography •  MulK-‐Version Concurrency Control

–  D. Quass, J. Widom: On-‐Line Warehouse View Maintenance. SIGMOD Conference 1997: 393-‐404

–  V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, Christof B.: Efficient transacKon processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742

•  Data ParKKon TransformaKons –  V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, B. Christof: Efficient transacKon processing in SAP HANA database: the end of a column store myth. SIGMOD Conference 2012: 731-‐742

–  A. Lamb, M. Fuller, R. Varadarajan, N. Tran, B. Vandier, L. Doshi, C. Bear: The VerKca AnalyKc Database: C-‐Store 7 Years Later . PVLDB 5(12): 1790-‐1801 (2012)

Bibliography •  DB Toaster

–  DBToaster: Higher-‐order Delta Processing for Dynamic, Frequently Fresh Views, Y. Ahmad O. Kennedy, C. Koch, . M. Nikolic, Proc VLDB 2012

•  ParKKon Revisions –  S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, N. Thombre: ConKnuous analyKcs over disconKnuous streams. SIGMOD 2010:1081-‐1092

•  Temporal Consistency Management –  Lukasz Golab, Theodore Johnson: Consistency in a Stream Warehouse. CIDR 2011:114-‐122

•  Bounded Tardiness Scheduling –  H. Leontyev, J. H. Anderson: Generalized tardiness bounds for global mulKprocessor scheduling. Real-‐Time Systems 44(1-‐3): 26-‐71 (2010)

Bibliography •  Stream Warehouse Scheduling

– Lukasz Golab, Theodore Johnson, Vladislav Shkapenyuk: Scalable Scheduling of Updates in Streaming Data Warehouses. IEEE Trans. Knowl. Data Eng. 24(6): 1092-‐1105 (2012)

– S. Guirguis, M. A. Sharaf, P. K. Chrysanthis, A. Labrinidis, K. Pruhs, AdapKve Scheduling of Web TransacKons. Proc. 2009 Intl. Conf. on Data Engineering

Bibliography •  Data stream quality

–  Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, Divesh Srivastava: SequenKal Dependencies. PVLDB 2(1): 574-‐585 (2009)

–  Lukasz Golab, Howard J. Karloff, Flip Korn, Barna Saha, Divesh Srivastava: Discovering ConservaKon Rules. ICDE 2012: 738-‐749

–  Tamraparni Dasu, Ji Meng Loh: StaKsKcal DistorKon: Consequences of Data Cleaning. PVLDB 5(11): 1674-‐1683 (2012)

–  Lukasz Golab, Data Warehouse Quality: Summary and Outlook, In: S. Sadiq (ed.), Handbook of Data Quality -‐ Research and PracKce, Springer-‐Verlag Berlin Heidelberg 2013

LukaszGolab$ - University of Waterloolgolab/sigmod13_tutorial.pdf ·...

Documents