A Framework of Mining Trajectories From Untrustworthy Data

7/24/2019 A Framework of Mining Trajectories From Untrustworthy Data

1/35

16

A Framework of Mining Trajectories from Untrustworthy Datain Cyber-Physical System

LU-AN TANG, NEC Labs AmericaXIAO YU, QUANQUAN GU, and JIAWEI HAN, University of Illinois at Urbana-ChampaignGUOFEI JIANG, NEC Labs AmericaALICE LEUNG, BBN TechnologyTHOMAS LA PORTA, Pennsylvania State University

A cyber-physical system (CPS) integrates physical (i.e., sensor) devices with cyber (i.e., informational) com-ponents to form a context-sensitive system that responds intelligently to dynamic changes in real-worldsituations. The CPS has wide applications in scenarios such as environment monitoring, battlefield surveil-lance, and traffic control. One key research problem of CPS is called mining lines in the sand. With a largenumber of sensors (sand) deployed in a designated area, the CPS is required to discover all trajectories (lines)of passing intruders in real time. There are two crucial challenges that need to be addressed: (1) the collectedsensor data are not trustworthy, and (2) the intruders do not send out any identification information. Thesystem needs to distinguish multiple intruders and track their movements. This study proposes a methodcalled LiSM(Line-in-the-Sand Miner) to discover trajectories from untrustworthy sensor data. LiSMcon-structs a watching network from sensor data and computes the locations of intruder appearances based onthe link information of the network. The system retrieves a cone model from the historical trajectories totrack multiple intruders. Finally, the system validates the mining results and updates sensors reliabilityscores in a feedback process. In addition,LoRM(Line-on-the-Road Miner) is proposed for trajectory discoveryon road networksmining lines on the roads.LoRMemploys a filtering-and-refinement framework to reducethe distance computational overhead on road networks and uses a shortest-path-measure to track intruders.The proposed methods are evaluated with extensive experiments on big datasets. The experimental resultsshow that the proposed methods achieve higher accuracy and efficiency in trajectory mining tasks.

Categories and Subject Descriptors: H.2.8 [Database Applications]: Data Mining

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Cyber-physical system, sensor network, trajectory

Research was sponsored in part by the U.S. Army Research Lab under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA); the Army Research Office under Cooperative Agreement No. W911NF-13-1-0193; Na-tional Science Foundation IIS-1017362, IIS-1320617, and IIS-1354329; HDTRA1-10-1-0120; and MIAS, aDHS-IDS Center for Multimodal Information Access and Synthesis at UIUC. The views and conclusionscontained in this document are those of the authors and should not be interpreted as representing the offi-

cial policies, either expressed or implied, of the Army Research Laboratory or the U.S. government. The U.S.government is authorized to reproduce and distribute reprints for government purposes notwithstandingany copyright notation here on.Authors addresses: L.-A. Tang and G. Jiang, NEC Labs America, 4 Independence Way, Suite 200, Princeton,NJ 08540; emails: {ltang, gfj}@nec-labs.com; X. Yu, Q. Gu, and J. Han, University of Illinois at Urbana-Champaign, 201 N. Goodwin Avenue, Urbana, IL 61801; emails: {xiaoyu1, qgu3, hanj}@illinois.edu; A. Leung,BBN Technology, 10 Moulton Street, Cambridge, MA 02138; email: [email protected]; T. La Porta, Pennsyl-vania State University, 342 Information Sciences and Tech. Building, University Park, PA 16802; email:[email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected] 2015 ACM 1556-4681/2015/02-ART16 $15.00

DOI:http://dx.doi.org/10.1145/2700394

ACM Transactions on Knowledge Discovery from Data, Vol. 9, No. 3, Article 16, Publication date: February 2015.
http://dx.doi.org/10.1145/2700394http://dx.doi.org/10.1145/2700394


2/35

16:2 L.-A. Tang et al.

ACM Reference Format:

Lu-An Tang, Xiao Yu, Quanquan Gu, Jiawei Han, Guofei Jiang, Alice Leung, and Thomas La Porta. 2015.A framework of mining trajectories from untrustworthy data in cyber-physical system. ACM Trans. Knowl.Discov. Data 9, 3, Article 16 (February 2015), 35 pages.DOI:http://dx.doi.org/10.1145/2700394

1. INTRODUCTION

A cyber-physical system (CPS) is an integration of sensor networks with informa-tional devices [National Science Foundation2008]. The CPS employs a large numberof low-cost, densely deployed sensors to watch over designated areas and automaticallydiscover passing intruders. Such a system has many promising applications in bothmilitary and civilian fields, including missile defense [Hwang et al. 2004], battlefieldawareness [Hewish2001; Tang et al.2012a], traffic control [Lo et al.2008;Zheng andZhou2011], neighborhood watch [Li et al.2011b], environment monitoring [Tolle et al.2005], and wildlife tracking [Li et al. 2011a]. The key problem in the preceding ap-plications is called mining lines in the sand [Arora et al. 2004]that is, discoveringtrajectories of passing intruders from the collected sensor data.

Figure 1 shows the framework of a battlefield CPS: the sand (seismic, acoustic, andmagnetic sensors) is deployed in a designated area. It constantly collects signals of

vibration, sound, and magnetic force from the environment. When an intruder passesby, these nsors detect a signal change and send out detection records. The systemanalyzes the collected data and discovers intruder trajectories in real time. Such asystem helps military forces see through the fog of war and protect troops and baseson the battlefield.

However, mining lines in the sand is considered one of the major challenges in CPSresearch field, partly due to the following problems:

Untrustworthy data: Many deployment experiences have shown that untrustwor-thy (i.e., faulty) data is the most serious problem that impacts CPS performance[Szewczyk et al. 2004;Tolle et al. 2005]. Untrustworthy data are generated due to

various reasons, including hardware failure, communication limits, environmentalinfluences, and so on. It is difficult to filter them out solely based on signal values,because the values of faulty signals are similar to the correct ones.

Tracking intruders: There are usually multiple intruders in the monitoring area,and the system is required to track all of them. Since the intruders do not send outany identification information, the system has to distinguish them and track theirmovements.

Massive data: A CPS usually contains hundreds and even thousands of sensors[National Science Foundation2008]. Each sensor generates a data record every fewminutes; such records form a big dataset. In several applications, actions must betaken immediately to deal with the intruders. The system is required to discovertrajectories in real time.

In this study, we propose a framework called LiSM(Line-in-the-Sand Miner) to dis-cover intruder trajectories from untrustworthy sensor data. LiSMfirst constructs awatching network to model the relationship between sensors and data records. Then

LiSMdetects the intruders appearances based on link information of the watchingnetwork. To track multiple intruders, a cone model is proposed to generate intrudertrajectories. The system employs a validation process to filter out false positives and

updates sensors reliability scores. The technical contributions of this study are sum-marized as follows.

Constructing a watching network to model the relationship among sensors, records,and intruders. Such a network helps detect the intruder appearances in everytime-stamp.

http://dx.doi.org/10.1145/2700394http://dx.doi.org/10.1145/2700394


3/35

A Framework of Mining Trajectories from Untrustworthy Data in Cyber-Physical System 16:3

Fig. 1. The framework of battlefield CPS.

Proposing a cone model to track multiple intruders. The cone model is constructedfrom historical trajectories and indicates the possible regions of an intruders nextappearance. The system matches newly detected appearances with the cone model

to generate intruder trajectories.Validating the candidate trajectories. The system filters out false positives and up-dates sensors reliability scores in a feedback process.

Extending the proposed framework to support trajectory discovery on the roadnetwork.

Conducting extensive experiments to evaluate the effectiveness and efficiency ofproposed methods on big datasets. The experiment results show that our approachyields higher precision and recall than existing methods.

This article substantially extends the version from that presented at the ACMSIGKDD 2013 conference [Tang et al. 2013] in the following ways by (1) introduc-ing the task ofmining lines on the roads to model a new trajectory discovery problemon the road network; (2) analyzing the problems when applyingLiSMto new scenariosand proposing a new method called LoRM(Line-on-the-Road Miner); (3) designing a

filtering-and-refinement framework to improve algorithm efficiency; (3) proposing theshortest-path-measureto help track intruders on the road network; (4) carrying out thetime complexity analysis for proposed algorithms; (5) providing complete formal proofsfor properties and propositions; (6) covering the related studies in more details and in-cluding recent ones; (7) expanding our performance studies on road network datasets;and (8) discussing the important issues for mining lines in the sand. The experimentalresults show thatLoRMonly costs 15% to 20% the time ofLiSMand achieves higheraccuracy in mining tasks.

The rest of the article is organized as follows. Section2 introduces the backgroundknowledge and problem formulation. Section 3 proposes the techniques of intruderdetection. Section 4introduces the intruder tracking methods. Section 5 introduces

LoRMfor trajectory discovery on the road network. Section6evaluates the algorithmsperformances. Section 7 discusses some important issues of the problem. Section 8gives a survey of related work. Finally, Section9concludes the article.



4/35


Fig. 2. Example: the detection records.

2. PROBLEM STATEMENT

Recent advances in sensor technology have produced many types of sensors for area-monitoring purposes. Such sensors can be roughly classified into two categories: (1) ac-tive sensors (e.g., infrared sensors and radar sensors), which radiate signal pulses anddetect objects by the echo bouncing off the intruders, and (2) passive sensors (e.g., acous-tic sensors, seismic sensors, and magnetic sensors), which only receive signals from theenvironment. Active sensors achieve higher accuracy but require significantly morepower to operate and drain batteries quickly. Furthermore, when active sensors radi-ate signal pulses, they are at high risk of being detected by the intruders. As a result,the CPS is usually deployed with a large number of low-cost, energy-saving passivesensors.

Passive sensors constantly collect signals of sound, vibration, and magnetic forcesfrom the environment. When an intruder passes by, the sensors detect it based on thesignal changes. However, due to hardware limitation, the sensors can only report thepossible area of an intruders appearance rather than a point location. In this study,we model the reported area as a planar region bounded by a circle.

Definition 1 (Detection Record). Let si be a sensor and tj be a timestamp. Thedetection record ri,j is a two-tuples, ri,j={cen(ri,j ), rad(ri,j )}. cen(ri,j ) and rad(ri,j ) arethe center and radius of a round area indicating the possible position of intrudersappearance intj .

Example1. Figure2 shows a list of detection records in time t1. The solid trianglenode is the intruder o1. The round nodes are nearby sensors. The solid round nodes(red) are the responding sensors that send out detection records, such as s1, s2, ands7. The centers of the estimated regions are tagged as hollow triangles. Sensor s6 is anonresponding sensor that does not generate any detection record. It is tagged as ashadowed round node (blue).

Example 1 reveals three major problems of passive sensors. First, even if the intruderis detected by multiple sensors, each sensor reports an intruders appearance with amargin of error. The detection records should be aggregated for a more accurate result.

Second, some false-positive records are generated, such as r2,1 and r5,1. The systemmust filter them out. Third, the sensor, s6, should send out a detection record but failsto do so. It is a false negative.

False-positive and false-negative records are caused by various reasons, such asthe wind blowing and animal movements. Sensor reliability is a critical factor that



5/35


impacts the quality of detection results. We introduce two measurements of the sensorsreliability, as defined next.

Definition 2 (Valid Detection). Let qk,jbe the position of intruder ok in time tj . A recordri,j is called a valid detection if there exists an intruder ok that dist(cen(ri,j ),qk,j )rad(ri,j ).

Definition 3 (Robustness). Let s be a sensor. The robustness (s) is defined as theproportion of valid detections in all the records generated by s.

Definition 4 (Sensitivity). Let s be a sensor. The sensitivity (s) is defined as theprobabilities thatssends out a valid detection record when an intruder passes throughss watching area.

The robustness denotes the sensors detection precision, and the sensitivity denotesthe sensors recall. Knowledge of the sensors robustness and sensitivity is important for

filtering out false data. However, the two scores may change over time. In the beginning,the sensors robustness and sensitivity are both high. As time elapses, sensors may bedamaged by the harsh environment or run out of battery power. Therefore, both scoreswill drop, and they should be dynamically updated based on the detection results.

The intruder is an object entering the watching area. The system discovers theintruders movement as anintruder trajectory, which is a sequence ofintruder appear-ancesin different timestamps.

Definition5 (Intruder Trajectory). Letokbe an intruder and tj be a timestamp. Theintruder appearance pk,j= {xk, tj}indicatesoks spatial position xkin tj . The intrudertrajectory is defined as Lk= {pk,1,pk,2, . . . ,pk,n}.

Since users are only interested in trajectories that are long enough, they may set athreshold on the trajectory size. In addition, the sensor data arrive continuously ina data stream format. The system cannot output the results after scanning the wholedataset. Users require intruder trajectories to be discovered in real time.

The main theme of this study is on data mining, and we assume that the sensorshave been already deployed and synchronized. The detection records are collected andtransmitted to a data center. There are many state-of-the-art works on sensor deploy-ment and synchronization, gateway design, and message transmission [Sivrikaya and

Yener 2004; Cevher and Kaplan 2007]. Now the task boils down to finding out theintruders trajectories from sensor data.

Problem Statement (Mining Lines in the Sand). Let S be the set of sensors andR

be the sensor data arriving by time, R={R1, R2, . . . ,Rj, . . .},where Rj= {r1,j ,r2,j , . . .,rm,j}. The sensors locations are fixed, and their robustness and sensitivity scores areinitialized. Given a length threshold , the task of mining lines in the sand is to discoverthe set of intruder trajectories L={L1, L2, . . . ,Lk} in real time, where size(Lk) .

Note that the total number of intruders is not known in advance. LiSMis requiredto discover trajectories of all intruders entering the watching area. Due to the largernumber of false-positive and false-negative records, if the intruder has a long trajectoryacross monitoring region, it is very hard for the system to detect all intruder appear-ances and track them together as the original trajectory. Instead, the system shouldreport some subtrajectories, which can be composed to recover majority parts of the

original trajectory. The goal of the system is to provide trustworthy trajectories withenough length to help the user understand the movement of intruders and make adecision.

The system framework is illustrated in Figure 3. LiSM is composed with threemodules: the intruder appearance miner, the trajectory generator, and the trajectory



6/35


Fig. 3. The system framework ofLiSM.

Fig. 4. List of notations.

validator. The appearance miner constructs a watching network from the sensor dataand detects the intruder appearances in each snapshot. The trajectory generatorcomposes the detected appearances to be intruder trajectories. A cone model is builtfrom the historical trajectory to predict the intruders next possible movement. Thedetected intruder appearances are matched with the prediction, and the best-matchedone is added to the corresponding trajectory. The trajectory validator calculates the

trustworthiness of each candidate trajectories, selects the ones with high trustwor-thiness as mining results, and removes the low-trustworthy candidates. Finally, thesystem updates sensor reliability scores based on the trajectory trustworthiness.

We will introduce the detailed techniques ofLiSMin the following sections. Figure4lists the notations used throughout this article.



7/35


Fig. 5. Example: the watching network.

3. THE WATCHING NETWORK

In Example 1,s1,s3,and s4all detect the appearance of intrudero1. However, anothernearby sensor, s6, should detect the intruder but does not generate any record. Sucha nonresponding sensor disagrees with its responding neighbors. Therefore, the firsttask ofLiSMis to retrieve the hidden relationships of these sensors and intruders.

Definition 6 (Watching Sensors). Let S be the sensor set and ri,j be a detectionrecord. The watching sensor set S(ri,j ) is defined as S(ri,j )

= {s|s

S,dist(s, cen(ri,j ))range(s)+rad(ri,j )}, where dist(s, cen(ri,j )) is the distance between sensor s and the

center ofri,j ,range(s) is the sensors maximum sensing range.

Based on the detection records, the watching sensor set is partitioned into two partsofresponding sensorsand nonresponding sensors.

Definition7 (Responding and Nonresponding Sensors). Let ri,jbe a detection record,andS(ri,j ) be the watching sensor set ofri,j . The responding sensor setSr(ri,j ) is definedas Sr(ri,j )= {sk|sk S(ri,j ),rk,j that dist(cen(rk,j ), cen(ri,j )) r ad(rk,j)+ r ad(ri,j )}, thenonresponding sensor set Sn(ri,j )=S(ri,j )Sr(ri,j).

For a sensor s, if dist(s, cen(ri,j )) > range(s)

+ rad(ri,j ), then s cannot detect the

intruder. Ifs locates in the area that range(s)rad(ri,j )dist(s, cen(ri,j )) range(s)+rad(ri,j ), s will not generate any record if the position of intruder is out ofrange(s).The watching sensor set includes all sensors within range(s) + rad(ri,j )}. Therefore, theset of nonresponding sensors is actually a superset of the false-negative ones. We willrefine them later, after computing the intruder position.

With Definitions 6 and 7, we can construct a watching network. This network con-tains nodes representing sensors and records. Two types of links are constructed in thenetwork: positive links connect the records to responding sensors, and negative linksconnect the records to nonresponding sensors.

Example 2. Figure5 shows a watching network constructed from the records in

Example 1. For the sake of simplicity, we assume that all sensors have the same sensingrange in this example. The system draws a circle for each record ri,j . The circles centeris at cen(ri,j ), and the radius is range(s)rad(ri,j ). The watching sensors S(ri,j ) arelocated inside this circle (e.g., Figure5shows a circle ofr4,1). The system then connectsrecords with positive links (solid lines) to responding sensors and generates negative



8/35


links (dashed lines) between records and nonresponding sensors. Since sensor s6 doesnot send any record, it has negative links to all related records. Note that even thoughs2 is a watching sensor of r4,1 and s2 sends out a detection record r2,1, the distance

betweencen(r4,1) andcen(r2,1) is larger thanrad(r4,1)+ rad(r2,1), thus the link betweens2andr4,1is a negative link.In the sensor data, many detection records are caused by the same intruderfor

example, r1,1, r3,1and r4,1are caused by intruder o1. Such records are called homologousrecords.

Definition8 (Homologous Record Set). Letqk,j be the position of intruderokin timetj and Rj be the detection record set in tj . The homologous record set ofqk,j is definedas Hk,j= {r|ri,j Rj,dist(cen(ri,j ),qk,j )rad(ri,j )}.

If the intruders position, qk,j , is known in advance, the system can easily find thehomologous records. However, the intruders position is exactly required as the miningresult. The system has to approximate the homologous records based on the followingproperty.

PROPERTY1. Let Hk,j be a homologous record set in tj, ri,j, rl,j Hk,j be two records,and si, sl be the sensors that send out those records. Then, si is a responding sensor ofrl,j and sl is a responding sensor of ri,j .

PROOF. Let qk,j be the position of the corresponding intruder in Hk,j . According toDefinition 8,dist(cen(ri,j ),qk,j )rad(ri,j ) anddist(cen(rl,j ),qk,j )rad(rl,j ).

Based on triangle inequality, dist(cen(ri,j ),cen(rl,j )) dist(cen(ri,j ),qk,j ) +dist(cen(rl,j ),qk,j )rad(ri,j)+rad(rl,j ).

By Definition 7,siis a responding sensor ofrl,jand slis a responding sensor ofri,j .

The homologous record sets can be approximated by scanning the watching network.The system first picks a record as the seed to initialize a homologous record set. Thenthe system repeats following steps: (1) randomly selecting a record in the set, retrievingall responding sensors following the positive links; (2) checking each responding sensor,and if it is also the responding sensor for other records in the set, adding this record tothe set. This iteration ends when all records in the homologous set have been processed.In this way, we can guarantee that there is an intersected area among all memberrecords of the approximated homologous record set.

Once a homologous record set is generated, we estimate the position of an intruderappearance with Equation (1), wherei,j is a normalized weight based on the radius of

ri,j . The records with with lower uncertainty (i.e., smaller radius) have higher weightsin determining the position of intruder appearance. Note that we adopt a linear modelto compute i,j for general cases; the weight computation can be modified based onspecific signal decay models of the sensors:

pk,j=

ri,jHk,ji,j cen(ri,j )

i,j=1 rad(ri,j )

rl,jHk,j rad(rl,j ). (1)

When pk,j is computed, we can refine the watching sensor sets and filter out the

sensors that locate outside the detection range.Definition 9 (Refined Watching Sensors). Let S be the sensor set and pk,j be an

intruder appearance. The refined watching sensor set S(pk,j ) is defined as S(pk,j )={s|sS,dist(s,pk,j ) < range(s)}.

http://-/?-http://-/?-


9/35


Fig. 6. Example: the watching network with intruder appearances.

Definition 10 (Refined Responding and Nonresponding Sensors). Let pk,j be anintruder appearance andS(pk,j ) the watching sensor set ofpk,j . The refined respondingsensor set Sr (pk,j ) is defined as Sr(pk,j )= {si|si S(pk,j ),ri,j that dist(pk,j ,cen(ri,j ))rad(ri,j )}, the refined nonresponding sensor set Sn(pk,j )=S(pk,j )-Sr(pk,j ).

The intruder appearances are added as new nodes to the watching network. Similarly,the positive and negative links are connected between the refined sensors and theappearances, as shown in Figure6.

With the link information of the watching network, we can estimate the trustwor-thiness of each intruder appearance based on the sensors robustness and sensitivity.For an appearance pk,j , let si Sr (pk,j ) be a responding sensor and sj Sn(pk,j ) be anonresponding sensor. Ifpk,j is a real appearance, thensireports a valid detection andsj is a false negative. The probability of pk,j being a valid detection is calculated asEquation(2), where (si) is the robustness ofsi and (sj) is the sensitivity ofsj :

Pr (pk,j)+=

siSr (pk,j )

(si)

sjSn(pk,j )(1 (sj )). (2)

Similarly, the probability of pk,j being a false positive can be written as Equation

(3):Pr (pk,j)

=

siSr (pk,j )(1 (si))

sjSn(pk,j )

(sj ). (3)

The trustworthiness of intruder appearance, (pk,j ), is then calculated asEquation(4):

(pk,j )= logPr (pk,j )

+

Pr (pk,j )

= siSr (pk,j )

log (si)

1(s

i)+

sjSn(pk,j )

log1 (sj )(s

j) . (4)

The robustness(si) and sensitivity (sj ) are initialized by user in the beginning.The system will automatically update the two scores based on the results of trajectorydiscovery. We will discuss the details of score updating in Section 4.

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


10/35


Fig. 7. Algoirthm: the intruder appearance detection.

Figure 7 lists the algorithm to detect intruder appearances. The algorithm first scanseach detection record and retrieves the responding and nonresponding sensors (lines 1through 4). Then the system initializes the homologous recordHk,jby randomly pickinga seed record from the watching network (lines 6 and 7). For each unvisited record ri,jin Hk,j , the algorithm retrievesri,j s responding sensors and checks its recordrl,j . Ifrl,jdoes not belong to any existing homologous record sets and the distance from rl,j to allother records ofHk,j is less than the sum of the radius,rl,j is then added to Hk,j(lines 8through 14). Once Hk,j is generated, the system calculates the intruder appearancepk,jand adds it to the network (lines 15 through 18).

PROPOSITION 1. Let m be the size of record set Rj and n be the size of sensor set S. Thetime complexity of Algorithm 1 is O(m2n).

PROOF. Algorithm 1 includes two steps: constructing the watching network (lines 1through 4) and detecting intruder appearances (lines 5 through 19).

In the first step, the system needs to scan the sensor set and retrieve the correspond-ing watching sensors for each record in Rj , and the total time cost is O(mn).

In the second step, the algorithm generates the homologous record sets by checkingthe responding sensors of unvisited records. Letmhbe the average size of homologousrecord sets and nr be the average number of responding sensors for one record. The timecost is O(m2

h

nr ). In the worst case,mh=

m andnr=

n. Algorithm 1s time complexity isO(m2n).

Note thatmhandnr are actually much smaller thanm andn in real cases. Then, thealgorithms time cost is close to O(mn).



11/35


4. TRAJECTORY GENERATION

The watching network discovers the intruder appearances in each snapshot. It is aneffective tool for mining dots in the sand. However, a more critical task is connecting

the dots as lines. Since the intruders do not send out any identification information,the system has to distinguish them automatically.

After mining the intruder appearances in the first snapshot,LiSMinitializes a setof candidate trajectories. Each candidate trajectory contains a discovered intruderappearance. In the following snapshots, the system continues adding newly detectedintruder appearances to the candidate trajectories.

Let pi,j be an intruder appearance in time tj and Lk be a candidate trajectory. Thekey problem is to calculate the likelihood pi,j belonging to Lk. This value is determinedby two factors: (1) (pi,j ), the trustworthiness of pi,j , and (2) P(pi,j,Lk), the matchingprobability based on the spatial locations of pi,j and Lk:

(pi,j

Lk)=

(pi,j )P (pi,j,Lk). (5)

The trustworthiness of pi,j is already calculated by Algorithm 1. To compute thethe matching probability between pi,j and Lk, we propose a cone model. This modelstores the intruders recent moving history and predicts the intruders next move in acone area. The detected intruder appearances are projected onto the area to computematching probability.

Definition11 (-recent Trajectory). Let Lkbe the trajectory of intruderok,tj be thecurrent timestamp, and be a positive number, size(Lk). The -recent trajectoryLk is defined as a subset ofLk, L

k= {pk,j,pk,j+1, . . . ,pk,j1}.

The-recent trajectory contains the-latest appearances of intruderokbefore time

tj . It is a short history of the intruders movement. The system can calculate oks recentmoving speed and direction based on Lk . The average and deviation of the intrudersspeed in period [tj, tj1] are calculated as Equations (6) and(7), where (tj1 tj) isthe time length between the two timestamps:

vk=

j2i=j

dist(pk,i,pk,i+1)

(tj1 tj) , (6)

(vk)=

j2

i=j

dist(pk,i,pk,i+1)2

(tj1 tj)(ti+1 ti)v2k . (7)

The functiondirection(pk,i,pk,i+1) is applied to measure the angle between oks mov-ing direction and the x-axis in time [ti, ti+1]. The mean and deviation of the movingdirection are computed as shown in Equations (8) and (9):

k=

j2i=j

direction(pk,i,pk,i+1)

(tj1 tj) , (8)

(k)=j2

i=j

direction(pk,i,p

k,i+1)2

(tj1 tj)(ti+1 ti) 2k . (9)

When intruders pass through the watching area, they are unlikely to change movingspeed and direction dramatically. We make the assumption that the values of intruder

http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-


12/35


Fig. 8. Example: the cone model.

speed and direction follow a normal distribution and build a cone area to predict oksappearance intj .

Example3. Figure8shows the cone model for intruder ok. Suppose thatis set to5; the system retrievesoks latest five appearances as L

k and computesoks speed and

direction. If those parameters follow a normal distribution, the probability is 99.7%that oks speed and direction of period [tj1, tj ] are within three standard deviations ofthe mean values. The system calculates the four boundary points as shown in Figure8.The area ofoks next possible appearance is then generated as a partial cone with apexin pk,j1.

Let pi,j be an intruder appearance in the cone area and pk,j1be the latest intruderappearance ofLk . If intruderokmoves from pk,j1to pi,j , thenoks speed and directionin [tj1, tj ] are estimated as Equations (10)and (11):

vk,j=dist(pk,j1,pi,j )

(tj tj1) , (10)

k,j=direction(pk,j1,pi,j )

(tj tj1) . (11)

By comparing vk,jand k,j , the system can estimate the matching probability between

pi,j andLkas Equation (12):

P(pi,j,Lk)=1

2(vk)exp

(vk,jvk)

2

2(vk)2

(12)

12(k)

exp

(k,j

k)2

2(k)2

.

Example 4. Suppose that there are three intruder appearances detected in tj , asshown in Figure 8. p1,jand p2,jare located in the cone area, and p3,jis outside the area.Their trustworthiness scores are (p1,j )= 0.1, (p2,j )= 0.8, and (p3,j )= 0.9. Eventhough p3,j has the highest trustworthiness, it is impossible that this is an appearance

of Lk. By considering the matching probability and trustworthiness of the remainingtwo appearances, the system selects p2,j as the intruders appearance in tj.

Note that we make the assumption that the values of intruder speed and directionfollow a normal distribution in this study. Based on our experiment results, this

http://-/?-http://-/?-


13/35


Fig. 9. Algorithm: the trajectory tracking.

assumption works well. The cone model can be adopted to other distributions/modelsof intruder movements.

If the trajectory Lk does not contain enough intruder appearances (i.e., size(Lk)), the system constructs a cone model with default speed v0 and (v0). The defaultparameters can be specified by the user or calculated as the mean of all other intruders-recent trajectories. The system also releases the constraint on movement direction(i.e., the intruder may move in any direction). The matching probability is then writtenas Equation(13):

P(pi,j,Lk)=1

2(vk)exp

(vk,jvk)

2

2(vk)2

. (13)

Figure 9 shows the detailed steps of trajectory tracking. For each candidate trajectoryLk, Algorithm 2 first checks the trajectory size. If the size is larger than , the systemretrieves -recent trajectory Lk and calculates the intruders speed and direction. Ifthe size ofLkis less than, the system uses the default parameters (lines 2 through 5).Then, the algorithm constructs the cone model. For each intruder appearance insidethe cone area, the system calculates the matching probability. The one with the highestprobability is tagged as matched and added in Lk (lines 6 through 15). Finally, the

system initializes new candidate trajectories for the unmatched intruder appearances(lines 16 through 18).

PROPOSITION 2. Let m be the size of record set Rj and n be the number of candidatetrajectories. The time complexity of Algorithm 2 is O(mn).

http://-/?-http://-/?-


14/35


PROOF. Algorithm 2 contains two steps: (1) retrieving the-recent trajectory and (2)computing the cone model.

The time complexity of step 1 is O(n).

For step 2, letmibe the total number of intruder appearances in time tjandmcbe theaverage number of intruder appearances in the cone of a trajectory. No matter whetherthe -recent trajectory can be retrieved or not, for each trajectory, the system has tocheck all intruder appearances in the cone area. The time cost is O(mcn).

In the worst case, for each trajectory, all intruder appearances are located in the conearea,mc= mi. In addition, the maximum number of intruder appearances equals thesize of record set Rj ,mi= m. Hence, the time complexity of step 2 is O(mn).

Sincem is magnitude larger than, the time complexity of Algorithm 2 is O(mn).

Note thatmi is the number of intruder appearances detected in each snapshot. It isusually much smaller than m. mc is the average number of intruder appearances inthe cone.mc is even smaller than mi. Algorithm 2s efficiency is indeed determined byn, which is the number of candidate trajectories. The system may create many newtrajectories in each snapshot, and nwill eventually be a large value as time elapses.The system needs to control the number of candidate trajectories to improve efficiency.

On the other hand, Algorithm 2 initializes new trajectories based on unmatchedintruder appearances in every snapshot. However, the majority of them are ghosttrajectories. The ghost trajectories are generated by the false-positive appearances,such as p2,1, p3,1in Figure6.When the time elapses, real trajectories grow longer withmore subsequent appearances added in, but ghost trajectories are unlikely to get moreappearances. Hence, we can eventually prune them.

Definition12 (Trajectory Expectation). LetLkbe a candidate trajectory andtj be the

current timestamp. The trajectory expectationEj (Lk) is an indicator on the expectationthat Lkbe a qualified mining result in time tj .Ej (Lk) is defined as Equation (14), wheret1is the timestamp of the first intruder appearance in Lk, and is a decay constant:

Ej (Lk)=

pk,iLk(pk,i) (tj t1). (14)

In the end of every snapshot, the system checks the expectation of each candidatetrajectory. If the expectation is less than zero, such a trajectory is unlikely to become aqualified result and should be removed from main memory. Meanwhile, if a trajectoryslength is longer than the threshold , the system will report it to the user.

In many CPS applications, the sensors may be damaged by the environment orrun out of battery power as time elapses; the system should also update the sensorsreliability scores.

Let Lkbe a candidate trajectory, Lk= {pk,1,pk,2, . . . ,pk,n}. IfLk is removed from thecandidate set as a ghost trajectory, all intruder appearances of Lk will be tagged asghost appearances. Let pk,j be such a ghost appearance. For all responding sensorssi Sr(pk,j ),si has reported a false positive, and its robustness should be reduced. (si)is then updated as shown in Equation (15), where li is the number of false positivesreported bysi, andni is the total number of detection records generated by si:

(si)=1 li

ni. (15)

Meanwhile, ifLkis output as a qualified mining result, all intruder appearances ofLkare considered to be true. Let pk,j be a true appearance; for the nonresponding sensorsj Sn(pk,j ),sj has made a false-negative error. The sensitivity ofsi is then reduced asshown in Equation (16), where fi is the number of false negatives by si, andmi is the



15/35


Fig. 10. Algorithm: mining lines in the sand.

total number of intruders passed through si s watching area. Let li be the number offalse positives bysiandnibe the total number of detection records sent by si,mi= nili+ fi:

(si)=1 fi

mi=1 fi

ni li+ fi. (16)

Algorithm 3 shows the detailed process ofLiSM(Figure10). When new data arrives,the system first calls Algorithm 1 to construct the watching network and detect theintruder appearances (lines 2 through 4), then tracks trajectories with the cone model(line 5). After that, the system checks each candidates trustworthy expectation (line 6).If the expectation is less than zero, such a trajectory is a ghost trajectory and shouldbe removed. The system retrieves the responding sensors for every appearance of theghost trajectory and reduces their robustness scores (lines 7 through 12). Meanwhile,if the trajectorys size reaches the length threshold , it will be added to the result set.The system retrieves all nonresponding sensors and reduces their sensitivity (lines 13through 20).

PROPOSITION 3. Let k be the number of record sets inR, l be the average size of Rj, mbe the size of the sensor set, and n be the average size of the candidate trajectory set. The

time complexity of Algorithm 3 is O(kl(lm+ n)).PROOF. Algorithm 3 has three steps to process the record setRj : the intruder detec-

tion (calling Algorithm 1), trajectory tracking (calling Algorithm 2), and the candidatevalidation step (lines 6 through 20).



16/35


According to Propositions 1 and 2, the time costs of Algorithm 1 and 2 are O(l2m)and O(ln).

In the validation step, the system needs to check the links for each intruder ap-

pearance. In the worst case, the system has to scan all links between the sensors andintruder appearances. Let pbe the average number of links between the sensors andrecords in the watching network Gj . The time complexity of this step is O(p). Sincep lm, the time cost of processing record set Rj can be written as O(l(lm+ n)), andthe total time complexity of Algorithm 3 is O(kl(lm+ n)).

5. MINING LINES ON THE ROADS

In the previous sections, we have investigated the problem of mining lines in the sand.LiSMis proposed to discover intruder trajectories in 2D Euclidean space. In many realapplications, the user requires detection of intruders moving on the roads. The problemof mining lines on the roads poses some unique challenges. This section proposesLoRM

for trajectory mining on the road network.Problem Statement (Mining Lines on the Roads).Let Mbe a road network, Sbe the

set of sensors installed on M, and R be the sensor data arriving by time, R= {R1,R2, . . . ,Rj, . . .},where Rj= {r1,j ,r2,j, . . . ,rm,j}. Given a length threshold , the task ofmining lines on the roads is to discover the set of intruder trajectories on M, L= {L1,L2, . . . ,Lk}, wheresize(Lk) .

The general framework ofLiSMcan be used byLoRM. The system first scans recordsets, constructs the watching network, and detects intruder appearances. Then, LoRMcombines detected appearances as intruder trajectories and updates sensors reliabilityscores. However, the detailed steps of both detection and tracking should be changedaccording to several unique difficulties of the problem scenario:

Constraints of intruder detection: The intruders move on the roads, but the sen-sors detection records may be off-road. The system should match sensors detectionrecords to the road network for meaningful detections.

Network distance computation: On the road network, the time cost to track intrudersis much higher, because the system has to search the shortest paths for distancecomputation. The algorithm efficiency becomes a major issue.

Unpredictable moving direction: The cone model is proposed to help intruder track-ing. Such a model assumes that the intruders move on a 2D plane, and hence theirmoving directions are predictable from the historical data. However, now the intrud-

ers are moving on the road network. Their moving directions are bounded along theroad segments, and the cone model is no longer feasible to predict the intrudersmovements.

5.1. Intruder Detection on the Roads

The sensors are installed on the roads to collect sound or vibration signals from environ-ments. When an intruder passes by, the sensors detect it based on the signal changes.Due to hardware limitations, the sensors detections may be off-road; the system needsto match the detection records onto the road network.

Definition13 (On-Road Detection). Let Mbe a road network, si be a sensor, tj be a

timestamp, and detection recordri,j= {cen(ri,j ),r ad(ri,j )}. On-road detectionroad(ri,j )is defined as a spatial coordinate ofM(road(ri,j )M) satisfying the following:(1)p M,dist(cen(ri,j ), road(ri,j ))dist(cen(ri,j ),p);(2) dist(cen(ri,j ), road(ri,j ))rad(ri,j ).



17/35


Fig. 11. Example: the on-road detections.

Intuitively,road(ri,j ) is the closest match of cen(ri,j ) to road network M. Since thearea of the intruders appearance is bounded by rad(ri,j ), the system only checks theroad segments located inside this area. If there is no road segment, detection record ri,jmust be a false positive. The system will directly remove ri,jand reduce the robustnessscore of sensorsi.

Example5. Figure11(a) shows a set of detection records in time t1. The round nodesare the monitoring sensors. The solid round nodes (red) are the responding sensors thatsend out detection records, such as s1, s3, and s4. The system matches the records to

the nearest roads to get on-road detections (hollow triangles). Since detectionr2,1 andr4,1 cannot be matched to any road segment, they are false positives (blue triangles).Sensors2is a nonresponding sensor that does not generate any detection record. It istagged as a shadowed round node (blue).

After matching detection records to the road network, the system computes respond-ing and nonresponding sensors for each road detection. The watching network is con-structed as Figure11(b). To estimate the position of intruder appearances, the systemgenerates homologous detection sets from the watching network. The intruder appear-ance, pk,j , is then calculated as a weighted average of on-road detections in the homol-ogous set, as shown in Equation(17). The weight i,j is determined by the distance

betweencen(ri,j ) androad(ri,j). The records with lower uncertainty (smaller distance)have higher weights:

pk,j= road

ri,jHk,ji,j

road(ri,j )

i,j=1 dist(cen(ri,j ) road(ri,j ))

rl,jHk,jdist(cen(rl,j ) road(rl,j ))

. (17)

Note that it is possible that

ri,jHk,ji,j locates outside the road network M. In such

cases, the system matches the coordinate to the nearest road as the position for pk,j .Figure12lists the algorithm to detect intruder appearances on the road network.The algorithm first matches the detection records to nearby roads to get on-road detec-tions. The false positives are removed (lines 1 through 6). Then, the system retrievesresponding and nonresponding sensors for each on-road detection (lines 7 through 9).

http://-/?-http://-/?-


18/35


Fig. 12. Algorithm: intruder detection on road network.

After that, the system generates homologous record sets, following similar steps asAlgorithm 1 (lines 11 through 19). Once a homologous set Hk,jis generated, the systemcalculates the intruder appearance pk,j and adds it to the watching network (lines 20through 23).

PROPOSITION 4. Let m be the size of record set Rj, n be the size of sensor set S, and lbe the average number of road segments in the possible area of a detection record. The

time complexity of Algorithm 4 is O(m(n + l)).PROOF. Algorithm 4 has two steps: generating on-road detections (lines 1 through 9)

and detecting intruder appearances (lines 10 through 25).In the first step, the system needs to match the detection records to the road network

and scan the sensor set for responding sensors; the total time cost is O(m(l + mn)).In the second step, the algorithm generates the homologous record sets by checking

the responding sensors of unvisited records. Letmhbe the average size of homologousrecord sets and nr be the average number of responding sensors for one record. Thetime cost is O(m2hnr). In the worst case, mh= mand nr= n. Then, Algorithm 4s timecomplexity is O(m(l + mn)).

5.2. Tracking Trajectories on the Roads

After detecting intruder appearances,LoRMneeds to generate trajectories. However,the algorithm efficiency becomes a problem. The bottleneck is computing the roadnetwork distance.



19/35


Fig. 13. Example: the problem of tracking intruders with the cone model.

Definition14 (Road Network Distance). Let pi and pj be two spatial coordinates ofroad network M. The road network distance netd(pi, pj ) is the length of the shortestpath connecting pi and pj on M.

Suppose that the system maintains m candidate trajectories in memory and detectsl intruder appearances in a new snapshot. The road network contains n edges (i.e.,road segments). The system has to compute the network distance between every pairof candidate and appearance. There are totallylm pairs. In the worst case, the systemhas to search all edges of the network to compute the shortest path. Hence, the totaltime complexity of intruder tracking is O(lmn). Note that a road network typicallycontains millions of edges;n is a very large number.

Another problem is about the tracking effectiveness, as illustrated in the followingexample.

Example 6. In Figure 13, the system retrieves a recent trajectory,Lk = {pk,1,pk,2}, andcomputes the cone model,conek. In time t3, six new intruder appearances are detectedas p1,3, p2,3, . . . ,p6,3. However, none of them is located inside conek. The problem iscaused by the cone model. The cone model assumes that intruderoks moving directioncan be predicted from the historical data. However, okmoves along the road segmentsnow. The cone model is thus no longer accurate.

To solve the problems, we propose two techniques: (1) a filtering-and-refinementframework to improve the tracking efficiency and (2) a shortest-path-measure for ef-

fectively tracking.PROPERTY2. In road network M, the Euclidean distance between two points is the

lower bound of the network distance:pi, pj M, dist(pi, pj )netd(pi,pj ).PROOF. In the Euclidean space, the shortest path between two points is a straight

line. Since the road network is also in the same Euclidean space, the Euclidean distanceis less than or equal to the road network distance.

The algorithms overhead can be significantly reduced based on Property 2. Let pk,j1be the last intruder appearance in candidate trajectory Lkand pi,j be a newly detectedappearance in time tj . Before computing the network distance between pk,j1 and

pi,j , the system first calculates the Euclidean distance dist(pk,j1,pi,j ). The Euclideandistance computation only needs the spatial coordinates and involves no cost to accessthe road network. Ifdist(pk,j1,pi,j )/(tjtj1) is already larger than the maximum speedof intruder ok, pi,j is impossible to be the next appearance ofok. The system filters itdirectly without computingnetd(pk,j1,pi,j ).



20/35


Fig. 14. Example: track intruder with three appearances.

Based on Property 2, LoRMruns a filtering process, as illustrated in Figure 14.The system draws a circle with pk,2 as the center and vmax(tjtj1) as the radius.The appearances that locates outside the circlefor example, p4,3, p5,3, and p6,3 areall filtered out. Only the inside ones (p1,3, p2,3, and p3,3) are possible to be oks nextappearance.

So which one of the three is most likely to be the next appearance? After a carefulexamination, one may find that p1,3 is more likely to be the one. Since ok has alreadymoved from pk,1to pk,2, it has to turn back to visit p3,3and p2,3. Such a move is relativelyrare in the real world.

In most cases, an intruder moves from the source to destination following the shortest

path between the two points. Based on this observation, we propose a shortest-path-measure as defined next.

Definition 15 (Partial Trajectory Length). Let Lk be a -recent trajectory, Lk ={pk,j,pk,j+1, . . . ,pk,j1}, j l j 2. The partial trajectory length is defined

aslength(pk,l,pk,j1)=j2

i=l netd(pk,i,pk,i+1).

Definition16 (Shortest-Path-Measure). Let pi,j be an intruder appearance, Lk be a

-recent trajectory, jl j 2. The shortest-path-measure between pk,l and pi,jis defined as Equation(18):

SP(pk,l,pi,j )=netd(pk,l,pi,j )

length(pk,l,pk,j1) + netd(pk,j1,pi,j ). (18)

Intuitively, this measure reflects the ratio of the shortest path and the actual distancebetween pk,land pi,j . If they are the same, SP(pk,l,pi,j ) has the maximum value as 1.

Let pi,j be an intruder appearance and Lk be a -recent trajectory. The av-

erage shortest-path-measure between Lk s apperances and pi,j is calculated asEquation(19):

ASPLk,p

i,j =

jl=j2S P(pk,l,pi,j )

w 1 . (19)

The trustworthiness of pi,j being the next appearance of Lk, (pi,j Lk), is thencomputed as shown in Equation (20). It has three parts: the trustworthiness ofpi,j , thematching probability based on moving speed, and the average shortest-path-measure



21/35


Fig. 15. Algorithm: tracking trajectories on road network.

between Lk and pi,j :

(pi,j Lk)=(pi,j ) 1

2(vk)exp

(vk,jvk)

2

2(vk)2

ASPLk ,pi,j. (20)

In Equation(20), we assume that the moving speed of the intruders still follows thenormal distribution. However, the moving speed may be influenced by the topology andtraffic conditions of the road. We will discuss this issue in Section 7.

Figure15lists the detailed steps to track intruders on the road network. For eachcandidate trajectory, Algorithm 5 first retrieves the -recent trajectory Lk (lines 2through 4). Then, the algorithm filters the intruder appearances using Property 2

(lines 6 through 8). After that, the system carries out the refinement process to computethe shortest-path-measure and the matching trustworthiness. The one with the highesttrustworthiness is tagged as matched and added toLk (lines 9 through 15). Finally, thesystem initializes new candidate trajectories for the unmatched intruder appearances(lines 16 through 18).

PROPOSITION 5. Let n be the number of edges in the road network, m be the number oftrajectory candidates, and l be the number of intruder appearances. The time complexityof Algorithm 5 is O(lmn).

PROOF. In the worst case, no appearances can be filtered out and the system hascompute the shortest distance between each pair of candidate trajectory and intruder

appearances. The total time complexity is still O(lmn).Note that even Algorithm 5s time complexity is the same as the original one. In the

experiment, we find that about 80% of the intruder appearances can be filtered andthe algorithms efficiency is improved dramatically.

http://-/?-http://-/?-


22/35


Fig. 16. Experiment settings.

6. PERFORMANCE EVALUATION

6.1. Experiment Setup

Datasets: To test the performance ofLiSMin big and untrustworthy data, we generatedfour datasets based on the real military trajectories from the CBMANET project [Krout2007]. The data generator retrieves 10 to 40 trajectories from CBMANET and simulates

sensor monitoring fields along their routes with 200 to 10,000 deployed sensors. Ifan intruder passes by, the sensor generates a detection record. The data generatorrandomly selects a portion of sensors as false-positive reporters, which may generatedetection records without any local intruder. The system also selects some a portionof negative sensors; such sensors may not send a detection record when an intruderpasses by. The detailed parameters of those datasets are listed in Figure 16.

Baselines: The proposed LiSMalgorithm (LM) is compared with two baselines: (1)The Kalman filteringbased method (KF) and (2) the TruAlarm method with nearest-neighboring tracking strategy (TA). TruAlarm is a method to evaluate the trustworthyalarms and detect intruder appearance Tang et al. [2010].

Environments: The experiments are conducted on a PC with Intel 7500 Dual CPU2.20GHz and 3.00GB RAM. The operating system is Windows 7 Enterprise. All algo-rithms are implemented in Java on the Eclipse 3.3.1 platform with JDK 1.5.0.

6.2. Evaluations on Mining Efficiency

In the first experiment, we evaluate the efficiency of different algorithms with defaultparameters. The system processes LM, KF, and TA on the four datasets and recordstheir time costs. Figure 17(a) shows the results on the four datasets. Note that they-axisis in logarithmic scale. In general, all three algorithms are efficient enough to processthe data. LM achieves the best efficiency in all cases, because the algorithm filters

out low-expectation trajectory candidates in each snapshot and tracks the trajectoriesquickly with the cone model.We then study the factors that influence LMs efficiency. We set the decay factor

from 0.05 to 0.2 and record the algorithms time cost on datasetsD1toD4in Figure 17(b).With larger , the system prunes more candidate trajectories and achieves better time



23/35


Fig. 17. Efficiency: time costs on different datasets (a) and influence of (b).

efficiency. We also study the algorithms running time with trajectory size threshold and the recent trajectory length . Both parameters do not influence the algorithmsefficiency, so we omit the results here.

6.3. Evaluations on Mining Effectiveness

To evaluate the quality of mining results, we retrieve the intruders true trajectories asground truth and compare against the mining results. There are two stage of mininglines in the sand: (1) detecting the intruder appearances and (2) tracking their trajec-tories. In this experiment, we first compare the detected intruder appearances withthe ground truth. If their distance is less than a reasonable error bound (20m), thedetection is considered as a valid result. Then we check each generated trajectory Lk;

since the user only prefers high-quality results to help in their decision making, weconsider Lkas a valid trajectory only if more than X% ofLks intruder appearances canbe matched to a real trajectory in the ground truth (the default X% is 90%).

We compute two measurements to evaluate the algorithms effectiveness:

Precision: The proportion of valid appearances/trajectories over the mining results.This represents the algorithms selectivity for filtering out false positives.

Recall: The proportion of valid appearances/trajectories over the ground truth. Thiscriterion shows the algorithms sensitivity for detecting the intruders.

Note that due to the large number of false-positive and false-negative records, if an

intruder has a long trajectory across the monitoring region, the system usually cannotdiscover the original one. Instead, LiSMoutputs several subtrajectories with lengthlarger than the threshold. The tracking recall is then calculated as the proportion of thenumber of intruder appearances in valid trajectories over the total number of intruderappearances in the trajectories of ground truth.

The detection precision and recall of LM, KF, and TA are shown in Figure 18.Allthree methods can achieve a relative high recall of about 80%. However, the precisionof KF and TA drops rapidly inD3and D4, which have more false detection records. Theprecision of KF is less than 20% in D4, which is only one fourth of LMs precision. TAsprecision is also lower than 50%. In contrast, LM filters out the false-positive data andkeeps the precision higher than 80%.

In the next experiment, we investigate LMs precision and recall with different tra-jectory length threshold . The results are shown in Figures 19and20.With larger, fewer trajectories are reported. Hence, the algorithms precision increases, but therecall drops. We also notice that the recall of LM on D2 is lower than other three.The reason is about the dataset; even the size of D2 is smaller than D3 and D4, but



24/35


Fig. 18. Effectiveness: detecting precision (a) and recall (b) of intruder appearances on different datasets.

Fig. 19. Effectiveness: detecting precision (a) and recall (b) with respect to .

Fig. 20. Effectiveness: tracking precision (a) and recall (b) with respect to .

some trajectories in D2 are hard to be tracked and the recall is lower. We will discusssuch cases in Section7.Based on the experiment results, our suggestion is to select

moderate (e.g., 8 to 10) to make LM achieve the best performance.Finally, we study the influences of parameter and . The results of LMs effective-ness are recorded in Figures21to 24.If the length of-recent trajectory is too short,LM may not be able to track the intruder with an accurate cone model. Therefore, should be set as a reasonable large value (e.g., 6 to 9).



25/35


Fig. 21. Effectiveness: detecting precision (a) and recall (b) with respect to.

Fig. 22. Effectiveness: tracking precision (a) and recall (b) with respect to.

Fig. 23. Effectiveness: detecting precision (a) and recall (b) with respect to .

The decay factor is used to filter the candidate trajectories; if it is set too large, the

algorithm may prune some trustworthy candidates. The recall of LM is then reduced.Note that the recall of LM drops rapidly on D4; nearly half of the intruder trajectoriescannot be discovered ifis reduced to 0.2. Because the false-positive and false-negativerates are much higher in D4 than other datasets, the trustworthiness scores of validdetection are not high. According to Equation (14), if the value of is set too high, some

http://-/?-http://-/?-


26/35


Fig. 24. Effectiveness: tracking precision (a) and recall (b) with respect to .

Fig. 25. Experiment settings of mining lines on the roads.

valid trajectories will also be pruned. As a result, the value of should be set relatively

small (e.g., 0.05) to discovery trajectories in a noisy dataset.

6.4. Experiments of Mining Lines on the Roads

We evaluate the performance of three algorithms: (1) LM, the originalLiSMalgorithm;(2) FR, the improved algorithm with a filtering-and-refinement framework, but stillusing the cone model for tracking; and (3) LR, theLoRMalgorithm, with a filtering-and-refinement framework and tracking based on the shortest-path-measure. The detailsof the experiment setting are listed in Figure25.

First, we evaluate the efficiency of three algorithms. The time costs on the fourdatasets are recorded in Figure26(a). Note that the y-axis is in logarithmic scale. FR

and LRs time costs are only 15% to 20% of LM. Figure 26(b) shows the percentage ofroad network distance computation time over the total cost. Without a filtering andrefinement framework, distance computation becomes the bottleneck of LM. However,FR and LR filter most intruder appearances without distance computation. A hugeamount of time is then saved.



27/35


Fig. 26. Efficiency: time costs on different datasets (a) and the portion of distance computation (b).

Fig. 27. Effectiveness: tracking precision (a) and recall (b) on different datasets.

In the next experiment, we check tracking precision and recall of the three algo-rithms. The results are shown in Figure27. Although FRs efficiency is close to LR, itstracking precision and recall are much lower. LR has the highest tracking precisionand recall on all of the datasets. Dataset D4 contains 50% false positives and 30%false negatives. LR still achieves about 90% precision and near 80% recall on it. Theseresults indicate that LR is more suitable than LM and FR to track intruders on theroad network.

7. DISCUSSIONIn this section we discuss some issues of mining lines in the sand.

(1) Why use the cone model for trajectory tracking?

In this study, we propose the cone model for trajectory tracking. Indeed, there areseveral state-of-the-art methods to model trajectory uncertainty and conduct trajectorypredictions, such as the cylinder model [Trajcevski et al.2004]and bead model [Pfoserand Jensen1999](more details of these works are introduced in Section 8.3). Most ofthem are designed for the purposes of completing low sampling GPS trajectories orrange query. In these models, the system computes possible regions of a moving objectbetween two consecutive trajectory points. However, in the scenario of mining lines in

the sand, the system needs to predict the future intruder appearance based on histor-ical data. We cannot use these uncertainty models directly. In addition, the efficiencyof model computation is a major issue. LiSMuses a greed algorithm for trajectorytracking, where each trajectory captures the best-matched appearance. Hence, thesystem can output the results in online time.



28/35


Fig. 28. Example: the different models for trajectory tracking.

Fig. 29. Effectiveness: tracking precision (a) and recall (b) on different models.

There are some prediction models similar to cone model, such as the pie model(Figure28(b)) and disc model (Figure28(c)). The disc model is computed with a fixedupper bound for moving speed, and the pie model is refined from the disc model byconsidering the moving direction. We conduct an experiment to test the effectivenessof these models. The experiment settings are the same as shown in Figure 16. We use

Algorithm 1 to detect intruder appearances and track the trajectories with differentmodels. The tracking precision and recall are shown in Figure29.

The results show that even the disc model and pie model have similar recall with thecone model; however, their tracking precision is much lower. The reason is the lowerbound for moving speed. Both the pie model and disc model do not have any restrictionson the minimum speed of the moving objects. The false-positive sensors continuouslyproduce many false detection records, and the positions of the false detections are closeto each other. Without a lower bound for moving speed, the system is likely to trackthose neighboring false detections together and report many ghost trajectories.

(2) In which cases is the cone model not effective?We propose the cone model to predict an intruders next appearance. This model is

designed based on an observation from the real data: when intruders travel throughmonitoring regions, they move with a relatively stable speed and direction. There are



29/35


Fig. 30. Example: the case of intruders meeting.

rare sudden changes of the speed or direction in the movements. In addition, mostintruders pass through the monitoring region as soon as possible; they do not stopor stay for a long time. Therefore, the cone model assumes that an intruders movingspeed and direction follow the normal distribution.

If the intruders have some specific movements, such as patrol around the region,stop at some positions, and frequently change speed and direction, the cone modelis then not effective to track them. The cone model is also not effective on the roadnetwork. We propose the shortest-path-measure and redesign the tracking algorithmon the road. In the real world, the vehicle movements on the road are influenced bymany factors, including the traffic, weather, and road condition. Since we only focus onstudying the sensors data, our road network model does not consider these factors. The

infrastructure of proposed methods can be easily adapted with more complex modelsto achieve better accuracy.

The cone model is also limited in the cases of meeting, as shown in Figure 30. oiand ojare two intruders moving through the monitoring region; they meet together and makea turn in time t4. The detected intruder appearances, pi,4 and pj,4, are close to eachother. In such a case, the cone model wrongly tracks their movements, and the minedtrajectories are misleading. The problem is caused by only considering the spatialinformation:oi s next appearance is the best match ofoj s historical movement, and thecone model cannot distinguish two intruders solely based on their recent trajectories.To solve the problem, the system needs extra information to identify multiple intruders,such as the signal strength emitted from the intruder (e.g., magnetic force or sound),

the speed of the intruder (not estimated from -recent trajectory but measured by thesensors), and so on. In our previous work [Tang et al. 2012a], we studied the problemof identifying the types of intruders based on the emitted signal strength.

(3) Are there any problems with using discovery results to measure sensor reliability?

In the framework ofLiSM, the systems feeds back the results of discovered trajecto-ries to adjust sensor robustness and sensitivity scores. This mechanism has a potentialrisk of mispunishment: (1) the ghost trajectories may contain some valid intruderappearances, and the system may wrongly reduce the robustness of the sensors thatmake correct detections; (2) it is possible that some false-positive detections make up aghost trajectory, and such a trajectory is longer than the length threshold. The system

then makes a mistake to reduce the sensitivity of all of nonresponding sensors. Thesecases are indeed rare, as most ghost trajectories do not contain any valid appearance,and few of them are longer than the length threshold.

To reduce the risk of mispunishment, we can set a lower bound for ni (the totalnumber of detection records by sensor si) in Equations (15) and (16). When updating the

http://-/?-http://-/?-http://-/?-http://-/?-


30/35


robustness and sensitivity scores, ifniis smaller than the lower bound, the system usesthe value of lower bound to replaceni. In this way, the influence of the preceding casesare reduced. On the other hand, if a sensor becomes unreliable and begins to generate

false positives or negatives, it is very likely to continue generating more false positivesand negatives. In this way, the number of false reports from this sensor will increaserapidly, and the system can detect it soon and reduce the robustness and sensitivity.

8. RELATED WORK

8.1. Trustworthiness Analysis in CPS

Sha et al. [2008] review the history of CPS development. The study of data trustworthi-ness is listed as one of the three major challenges. The CPS applications require a highlevel of trust in the operations. The trustworthiness is measured from reliability, safety,security, and usability. The system models and abstractions should incorporate faultmodels and recovery policies that reflect the scale, lifetime, control, and reparability ofcomponents.

Ganti et al. [2008] propose a CPS application of SenseWorld. It is used to facilitateconnecting sensors, people, and software objects to build community-centric sensingapplications. The authors point out that the first and foremost challenge is to in-fer higher-level information from the lower-level sensor data. To obtain meaningfulinformation, they use a frequent itemset mining algorithm to collect the high-levelinferences and get a general picture.

Johnson and Mitra [2009] provide several directions to handle the failures in CPS.They classify the failures into three categories: (1) the cyber-side failures, such as soft-ware bugs and system crashes; (2) the physical-side problems, such as sensor failuresand irrelevant object influences; and (3) the communication-side issues, including mes-

sage drops, omissions, man-in-the-middle attack, and so on. The authors suggest thatthe degradation of the system state could potentially be used to detect failures.

Makedon et al. [2009] design an event-driven framework for assistant CPS environ-ments. The event is modeled as the abnormal behavior of the system, such as accidentsor acute needs. Those behaviors are organized in a hierarchical tree. Two-step eventidentification first assimilates different types of data and then identifies the event ofinterest. A low-level security standard is applied to the raw data, and a high-levelsecurity strategy is used to check the generalized events.

The research of CPS is still at the beginning stage. Many papers have addressed theimportance of data trustworthiness, but detailed solution plans are seldom provided.

As mentioned in Johnson and Mitra[2009], the complexity of this problem is the main

challenge. Some studies use the strategies of abstraction and generalization to reducethe influences of false alarms [Ganti et al. 2008] or to detect untrustworthy data byperformance degradation of the system [Makedon et al. 2009]. However, most worksonly propose such strategies as research directions and do not have detailed solutions.

8.2. Intruder Detection from Sensor Data

Arora et al. [2004] propose the intrusion detection problem in wireless networks anddesign a detection model with a sensor network. The approach is based on a dense,distributed wireless network. The authors study nine types of sensors, including mag-netic, radar, thermal, acoustic sensors, and so on. Based on the performance require-

ments of the scenario and the sensing, communication, energy, and computation ability,the magnetic and radar sensors are used to detection intruders. The authors proposea classifier to determine the intruder type based on the number of detection recordsfor example, if a vehicle passes through the monitoring field, about 40 sensors de-tect it and send out records; if a soldier passes through the monitoring field, only 20



31/35


detection records are generated. Since the classification model is constructed based onthe number of detection records, the accuracy may be influenced by the false positives.Tang et al. propose the TruAlarm filtering method [Tang et al.2010, 2012b]. TruAlarm

carries out trustworthiness inferences for the sensor alarms based on the estimatedlocations of the objects.Sheng and Hu [2005] propose the maximum likelihoodbased estimation method.

This method uses acoustic signal energy measurements taken at individual sensorsof an ad hoc wireless sensor network to estimate the locations of multiple acousticsources. They propose a multiresolution search algorithm and an EM-like iterativealgorithm to expedite the computation of source locations. Lin et al. [2006] proposea framework for the in-network intruder tracking. They develop the DAT and ZATtree structures according to the physical topology of the sensor network. The proposedmethod can analytically formulate the cost of object tracking based on the update andquery rates, which is a significant improvement in this area. Tang et al. [2012a] propose

an IntruMine method to detect and verify the intruders in a sensor network. IntruMineconstructs several monitoring graphs to model the relationships between sensors andpossible intruders, and computes the position and energy of each intruder with the linkinformation from these monitoring graphs.

Cevher and Kaplan [2007] study the problem of how to assign sensors to differenttracking and monitoring tasks to achieve the optimal efficiency. They propose a sensorassignment algorithm with fuzzy location estimation.

Most methods try to detect the intruders in a single snapshotthat is, withoutconsidering the temporal information of the intruders movement. Some studies focuson saving sensors energy and communication bandwidth; they try to provide an optimalsensor deployment plan.LiSMandLoRMactually complement those technologies and

improve the systems applicability.

8.3. Trajectory Predication and Map Matching

Wolfson points out in a vision paper [Wolfson 2002] that the trajectory of a moving objectis inherently imprecise due to continuous motion. Trajcevski et al. [2004] propose thecylinder model for the uncertainty of trajectories. In this model, the possible regionof a moving object at a timestamp is within a disc, and the trajectory is representedas a cylinder in the 3D (2D+time) space. A comparable model, the space-time prism(beads) [Pfoser and Jensen 1999] is proposed to represent the uncertain movementof a moving object as the union of two half-cones (a bead) in the 3D space. Kuijpers

and Othman [2010] propose a bead modelbased approach to model the uncertaintyin trajectories. This method has significantly improved efficiency in managing theuncertainty of trajectories.

The problem of map matching (i.e., match the trajectories to the road network) hasbeen studied by many researchers in past decades. Greenfeld [2002] uses distanceand orientation similarity measures to match the trajectory points to a road edge.The adaptive clipping method uses the Dijkstra algorithm to construct the shortestpath on a local free space graph. Yin and Wolfson [2004] employ an offline snappingmethod that aims to find a minimum weight path based on edit distance. Brakatsoulaset al. [2005] propose the map-matching approaches based on the Frechet distanceand its variants. Frechet distance takes the continuity of curves into account and

is therefore suitable for comparing trajectories. Pink and Hummel [2008] proposea map-matching method based on a Bayesian classifier that incorporates a hiddenMarkov model to model topological constraints of the road network. Hummel andTischler [2005] extend the Kalman filter and cubic spline interpolation to enhance mapmatching.



32/35


Zheng et al. [2011] introduce the uncertain trajectories hierarchy for probabilisticrange queries. In this milestone work, the uncertainty of the objects moving along roadnetworks are modeled as time-dependent probability distribution functions by giving

a maximal speed on each road segment. In this way, both snapshot and continuousprobabilistic range queries can be conducted on the road network. Kuijpers et al. [2010]extend the models of space-time prisms to the road network. This model representsthe measurement errors of a moving object. An algorithm is proposed to compute theenvelope of all space-time prisms that have an anchor in these sample regions. Thespatial projection of the network-based space-time prisms under uncertainty is thusderived.

However, most of the state-of-the-art models are designed for the purposes of tra-jectory completion and range query. The uncertainly models are frequently used toaddress the low sampling problem in GPS trajectories. In the scenario of mining linesin sand, the sampling frequency of the sensors are relatively high; the main problem

is to predict the next position based on existing data. Hence, these models cannot beused directly.

8.4. Target Tracking in Sensor Network

Ozdemir et al. [2009] use the techniques of particle filtering to track intruders. Theymodel the link between sensors and the fusion center as a binary symmetric channeland create a general framework using particle filters. Three different types of channel-aware particle filters have been developed, including hard-decoding links, coherentsoft-decoding links, and noncoherent soft-decoding links.

Hammad et al. [2003] propose the stream window join algorithm to track movingobjects in sensor network database. Aslam et al. [2003] propose a geometry-based

method to track intruders. Lin et al. [2006] propose a framework for in-network intrudertracking. Zhong et al. [2009] provide the techniques to track targets with the sequenceof the alarming sensors. Trajcevski et al. [2010] propose the methods of trajectoryreduction for energy savings. Ghica et al. [2010] propose the tracking principles withepoch awareness. Liu et al. [2004] study the distributed state representation problemfor target tracking in sensor networks. Oh et al. [2009] propose the Markov chaindata association method for target tracking. Crespi et al. [2008] propose a quantitativetheory of trackability to investigate the consistent tracking number in a sensor networkapplication.

In these studies, the researchers assume that the intruders locations at each snap-shot are already known. They focus on combining the intruder detections at different

snapshots to generate trajectories. However, as pointed out in Arora et al. [2004], theintruder tracking results cannot be accurate based on many false intruder detections.In addition, the tracking model is a problem in theory because it is difficult to build anaccurate kinematic movement model [Crespi et al.2008]. The model must be adjusteddynamically during the tracking process. LiSMandLoRMutilize the -recent trajec-tory to rebuild the tracking model at every timestamp and thus keep high trackingaccuracy.

Pan et al. [2007a] use the supervised learning method to locate receiving signal sen-sors (RSS). The transfer learning techniques are proposed as semisupervised mining[Pan et al.2007b]. The main difference in their works is about the sensors. The RSSare moving sensors that receive signals from some fix nodes (APs), and the algorithm is

designed to learn sensor locations. In the CPS applications, most sensors are fixed andtheir locations are already known. The problem is to detect the location of intruders.To the best of our knowledge, this is the first study to solve both detecting and

tracking problems in an integrated framework. Figure 31 compares the features ofsome related approaches with the proposed method.



33/35


Fig. 31. The comparison to related works.

9. CONCLUSION AND FUTURE WORK

This study investigates the problem of mining trajectories in a CPS. We propose a novelmethod,LiSM, to discover intruder trajectories from untrustworthy sensor data. Thewatching network is designed to detect intruder appearances, and the cone model isused to track their trajectories. In addition,LoRMis proposed to discover trajectories

on the road network. We evaluate the proposed algorithms in extensive experimentson big datasets. The proposed methods achieve better performances on both miningefficiency and accuracy.

In the future, we are going to integrateLiSMandLoRMwith more information (e.g.,weather and traffic) to improve system performance.

REFERENCES

Anish Arora, Prabal Dutta, Sandip Bapat, Vinod Kulathumani, Hongwei Zhang, Vinayak Naik, VineetMittal, Hui Cao, Murat Demirbas, Mohamed G. Gouda, Young-ri Choi, Ted Herman, Sandeep S.Kulkarni, Umamaheswaran Arumugam, Mikhail Nesterenko, Adnan Vora, and Mark Miyashita. 2004.A line in the sand: A wireless sensor network for target detection, cl

Date post:	21-Feb-2018
Category:	Documents
Upload:	hamaad-rafique
View:	215 times
Download:	0 times

A Framework of Mining Trajectories From Untrustworthy Data

Documents