arXiv:1509.06659v3 [cs.SI] 18 Jun 2017

An Entity Resolution Approach to Isolate Instances ofHuman Trafficking Online

Chirag Nagpal, Kyle Miller, Benedikt Boecking and Artur [email protected], [email protected], [email protected], [email protected]

Carnegie Mellon University

Abstract

Human trafficking is a challenging law en-forcement problem, and a large amount ofsuch activity manifests itself on variousonline forums. Given the large, hetero-geneous and noisy structure of this data,building models to predict instances oftrafficking is an even more convolved atask. In this paper we propose and entityresolution pipeline using a notion of proxylabels, in order to extract clusters from thisdata with prior history of human traffick-ing activity. We apply this pipeline to 5Mrecords from backpage.com and report onthe performance of this approach, chal-lenges in terms of scalability, and somesignificant domain specific characteristicsof our resolved entities.

1 Introduction

Over the years human trafficking has grown to be achallenging law enforcement problem. The adventof the internet has brought the problem in the pub-lic domain making it an ever greater societal con-cern. Prior studies (Kennedy, 2012) have lever-aged computational techniques to this data to de-tect spatio-temporal patterns, by utilizing certainfeatures of the ads. Certain studies (Dubrawskiet al., 2015) have utilized machine learning ap-proaches to identify if ads could be possibly in-volved in human trafficking activity. Significantwork has also been carried out in building largedistributed systems, to store and process such data,and carry out entity resolution to establish on-tological relationships between various entities.(Szekely et al., 2015)

In this paper we explore the possibility of lever-aging this information to identify sources of theseadvertisements, isolate such clusters and identify

potential sources of human trafficking from thisdata using prior domain knowledge.

In case of ordinary Entity Resolution schemes,each record is considered to represent a single en-tity. A popular approach in such scenarios is a‘merge and purge’ strategy whereas records arecompared and matched, they are merged into asingle more informative record, and the individualrecords are deleted from the dataset. (Benjellounet al., 2009)

While our problem can be considered a case ofEntity Resolution, however, escort advertisementsare a challenging, noisy and unstructured dataset.In case of escort advertisements, a single adver-tisement, may represent one or a group of entities.The advertisements hence might contain featuresbelonging to more than one individual or group.

The advertisements are also associated withmultiple features, including Text, Hyperlinks, Im-ages, Timestamps, Locations etc. In order to fea-turize characteristics from text we use the regexbased information extractor based on the GATEframework (Cunningham, 2002). This allows usto generate certain domain specific features fromour dataset, including, the aliases, cost, location,phone numbers, specific URLs, etc of the entitiesadvertised. We use these features, along with othergeneric text, the images, etc as features for ourclassifier. The high reuse of similar features makesit difficult to use exact match over a single featurein order to perform entity resolution.

We proceed to leverage machine learning ap-proaches to learn a function that can predict iftwo advertisements are from the same source.The challenge with this is that we have no priorknowledge of the source of advertisements. Wethus depend upon a strong feature, in our casePhone Numbers, which can be used as proxy evi-dence for the source of the advertisements and canhelp us generate labels for the Training and Test

arX

iv:1

509.

0665

9v3

[cs

.SI]

18

Jun

2017

(a) Search Results on backpage.com (b) Representative escort advertisement

Figure 1: Escort advertisements are a classic source of what can be described as Noisy Text. Noticethe excessive use of Emojis, Intentional misspelling and relatively benign colloquialisms to obfuscate amore nefarious intent. Domain experts extract meaningful cues from the spatial and temporal indica-tors, and other linguistic markers to suspect trafficking activity, which further motivate the leveraging ofcomputational approaches to support such decision making.

data for a classifier. We can therefore use suchstrong evidence as to learn another function, whichcan help us generate labels for our dataset, thissemi-supervised approach is described as ‘surro-gate learning’ in (Veeramachaneni and Kondadadi,2009). Pairwise comparisons result in an ex-tremely high number of comparisons over the en-tire dataset. In order to reduce this, we use a block-ing scheme using certain features.

The resulting clusters are isolated for humantrafficking using prior expert knowledge and fea-turized. Rule learning is used to establish differ-ences between these and other components. Theentire pipeline is represented by Figure 2.

2 Domain and Feature Extraction

Figure 1 is illustrative of the search results of es-cort advertisements and a page advertising a par-ticular individual. The text is inundated withspecial characters, Emojis, as well as misspelledwords that are specific markers and highly infor-mative to domain experts. the text consists of in-formation, regarding the escorts area of operation,phone number, any particular client preferences,and the advertised cost. We proceed to build Reg-ular expression based feature extractors to extractthis information and store in a fixed schema, usingthe popular JAPE tool part of the GATE suite ofNLP tools. The extractor we build for this domain,AnonymousExtractor is open source and pub-lically available at github.com/mille856/CMU_memex.

Table 1 lists the performance of our extractiontool on 1,000 randomly sampled escort advertise-

Table 1: Performance of TJBatchExtractor

Feature Precision Recall F1 ScoreAge 0.980 0.731 0.838Cost 0.889 0.966 0.926E-mail 1.000 1.000 1.000Ethnicity 0.969 0.876 0.920Eye Color 1.000 0.962 0.981Hair Color 0.981 0.959 0.970Name 0.896 0.801 0.846Phone Number 0.998 0.995 0.997Restriction(s) 0.949 0.812 0.875Skin Color 0.971 0.971 0.971URL 0.854 0.872 0.863Height 0.978 0.962 0.970Measurement 0.919 0.883 0.901Weight 0.976 0.912 0.943

ments, for the various features. Most of the fea-tures are self explanatory. (The reader is directedto (Dubrawski et al., 2015) for a complete descrip-tion of the fields extracted.) The noisy nature,along with intentional obfuscations, especially incase of features like Names results in lower perfor-mance as compared to the other extracted features.

Apart from the Regular Expression based fea-tures, we also extract the hashcodes of the imagesin the advertisements, the posting date and time,and location.1

1These features are present as metadata, and do not re-quire the use of hand engineered Regexs.

github.com/mille856/CMU_memex

github.com/mille856/CMU_memex

Feature Extractionfrom

Raw DataRule Learning

Entity Resolutionwith

Strong Features

Sample Data andTrain Match

Function

Entity ResolutionWith learnt

Match Function

Figure 2: The proposed Entity Resolution pipeline

3 Entity Resolution

3.1 Definition

We approach the problem of extracting connectedcomponents from our dataset using pairwise entityresolution. The similarity or connection betweentwo nodes is treated as a learning problem, withtraining data for the problem generated by using‘proxy’ labels from existing evidence of connec-tivity from strong features.

More formally the problem can be considered tobe to sample all connected components Hi(V, E)from a graph G(V, E). Here, V , the set of ver-tices ({v1, v2, ..., vn}) is the set of advertisementsand E , {(vi, vj), (vj , vk), ..., (vk, vl)} is the set ofedges between individual records, the presence ofwhich indicates they represent the same entity.

We need to learn a function M(vi, vj) such thatM(vi, vj) = Pr((vi, vj) ∈ E(Hi), ∀Hi ∈ H)

The set of strong features present in a givenrecord can be considered to be the function ‘S’.Thus, in our problem, Sv represents all the phonenumbers associated with v.

Thus S =⋃Svi ,∀vi ∈ V . Here, |S| << |V|

Now, let us further consider the graph G∗(V, E)defined on the set of vertices V , such that(vi, vj) ∈ E(G∗) if |Svi ∩ Svj | > 0 (more sim-ply, the graph described by strong features.)

Let H∗ be the set of all the of connected com-ponents {H∗1(V, E),H∗2(V, E), ...,H∗n(V, E)} de-fined on the graph G∗(V, E)

Now, function P is such that for any pi ∈ SP(pi) = V(H∗k) ⇐⇒ pi ∈

⋃Svi , ∀vi ∈ V(H∗k)

3.2 Sampling Scheme

For our classifier we need to generate a set of train-ing examples ‘T ’, and Tpos & Tneg are the subsetsof samples labeled positive and negative.Tpos = {Fvi,vj |vi ∈ P(pi), vj ∈ P(pi), ∀pi ∈ S}Tneg = {Fvi,vj |vi ∈ P(pi), vj 6∈ P(pi),∀pi ∈ S}

In order to ensure that the sampling schemedoes not end up sampling near duplicate pairs, weintroduce a sampling bias such that for every fea-ture vector Fvi,vj ∈ Tpos, Svi ∩ Svj = φ

Figure 3: On applying our match function, weaklinks are generated for classifier scores above acertain match threshold. The strong links betweennodes are represented by Solid Lines. Dashedlines represent the weak links generated by ourclassifier.

This reduces the likelihood of sampling near-duplicates as evidenced in Figure 4, which is a his-togram of the Jaccards Similarity between the setof the unigrams of the text contained in the pair ofads.

sim(vi, vj) =|unigrams(vi)∩unigrams(vj)||unigrams(vi)∪unigrams(vj)|

We observe that although we do still end withsome near duplicates (sim > 0.9), we have highnumber of non duplicates. (0.1 < sim < 0.3)which ensures robust training data for our classi-fier.

0.0 0.2 0.4 0.6 0.8 1.0Text Similarity

05000

100001500020000250003000035000

Num

ber o

f Pai

rs

Figure 4: Text Similarity for our SamplingScheme. We use Jaccards Similarity between thead unigrams as a measure of text similarity. Thehistogram shows that the sampling scheme resultsin both, a large number of near duplicates and nonduplicates. Such a behavior is desired to ensure arobust match function.

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

sRFLRNBRnd

Regx+Temporal

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx+Temporal+NLP

0.0 0.2 0.4 0.6 0.8 1.0False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

10 3 10 2 10 1 100

False Positives

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

s

RFLRNBRnd

Regx+Temporal+NLP+Spatial

Figure 5: ROC Curves for our Match Function trained on various feature sets. The ROC curve showsreasonably large True Positive rates for extremely low False Positive rates, which is a desirable behaviourof the match function.

3.3 Training

To train our classifier we experiment with variousclassifiers like Logistic Regression, Naive Bayesand Random Forest using Scikit. (Pedregosa et al.,2011) Table 2 shows the most informative featureslearnt by the Random Forest classifier. It is inter-esting to note that the most informative featuresinclude, the spatial (Location), Temporal (TimeDifference, Posting Date) and also the Linguistic(Number of Special Characters, Longest CommonSubstring) features. We also find that the domainspecific features, extracted using regexs, prove tobe informative.

Table 2: Most Informative Features

Top 10 Features1 Location (State)2 Number of Special Characters3 Longest Common Substring4 Number of Unique Tokens5 Time Difference6 If Posted on Same Day7 Presence of Ethnicity8 Presence of Rate9 Presence of Restrictions10 Presence of Names

The ROC curves for the classifiers we testedwith different feature sets are presented in Figure5. The classifiers performs well, with extremelylow false positive rates. Such a behavior is de-sirable for the classifier to act as a match func-

0.495 0.496 0.497 0.498 0.499 0.500 0.501

0

2000

4000

6000

8000

10000Si

ze o

f Con

. Com

pone

nt

0.495 0.496 0.497 0.498 0.499 0.500 0.50102004006008001000120014001600

No. o

f Con

. Com

pone

nts

Logistic Regression

0.88 0.90 0.92 0.94 0.96 0.98 1.00

2000

4000

6000

8000

10000

Size

of C

on. C

ompo

nent

0.88 0.90 0.92 0.94 0.96 0.98 1.000

200

400

600

800

1000

1200

1400

No. o

f Con

. Com

pone

nts

Random Forest

Figure 6: The plots represents the number of con-nected components and the size of the largest com-ponent versus the match threshold.

tion, in order to generate sensible results for thedownstream tasks. High False Positive rates, in-crease the number of links between our records,leading to a ‘snowball effect’ which results in abreak-down of the downstream Entity Resolutionprocess as evidenced in Figure 6.

In order to minimize this breakdown, we need toheuristically learn an appropriate confidence valuefor our classifier. This is done by carrying out theER process on 10,000 randomly selected recordsfrom our dataset. The value of size of the largestextracted connected component and the numberof such connected components isolated is calcu-lated for different confidence values of our clas-sifier. This allows us to come up with a sensibleheuristic for the confidence value.

3.4 Blocking Scheme

Our dataset consists of over 5 million records.Naive pairwise comparisons across the dataset,makes this problem computationally intractable.In order to reduce the number of comparisons,

Table 3: Results Of Rule Learning

Rule Support Ratio LiftXminchars<=250, 120000<Xmaximgfrq, 3<Xmnweeks<=3.4, 4<Xmnmonths<=6.5 11 90.9% 2.67

Xminchars<=250, 120000<Xmaximgfrq 4<Xmnmonths<=6.5, 16 81.25% 2.4Xstatesnorm<=0.03, 3.6<Xuniqimgsnorm<=5.2, 3.2<Xstdmonths 17 100.0% 2.5Xstatesnorm<=0.03, 1.95<Xstdweeks<=2.2, 3.2<Xstdmonths 19 94.74% 2.37

Bigrams Unigrams Images

Figure 7: Blocking Scheme

we introduce a blocking scheme and performa ex-haustive pairwise comparisons only within eachblock before resolving the dataset across blocks.We block the dataset on features like Rare Uni-grams, Rare Bigrams and Rare Images.

0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate

0.0

0.2

0.4

0.6

0.8

1.0

True

Pos

itive

Rat

e

Positive SetMean Random Guessing

Figure 9: ROC for the Connected Componentclassifier. The Black line is the positive set, whilethe Red line is the average ROC for 100 randomlyguessed predictors.

0 5 10 15 20 25False Positives

0

5

10

15

20

25

True

Pos

itive

s

Max Rules: 5Max Rules: 3Max Rules: 2

Figure 10: PN Curve for rule learning. The fig-ure presents PN curves for various values of theMaximum Rules learnt for the classification.

4 Rule Learning

We extract clusters and identify records that areassociated with human trafficking using domainknowledge from experts. We featurize the ex-tracted components, using features like size ofthe cluster, the spatio-temporal characteristics, andthe connectivity of the clusters. For our analy-sis, we consider only components with more than300 advertisements. we then train a random for-est to predict if the clusters is linked to humantrafficking. In order to establish statistical signifi-cance, we compare the ROC results of our classi-fier in 4 cross validation for 100 random connectedcomponents versus the positive set. Figure 9 &Table 4 lists the performance of the classifier interms of False Positive and True Positive Ratewhile Table 5 lists the most informative featuresfor this classifier.

We then proceed to learn rules from our feature-set. Some of the rules with corresponding Ratiosand Lift are given in Table 3. PN curves corre-sponding to various rules learnt are presented inthe Figure 10 It can be observed that the featuresused by the rule learning to learn rules with max-imum support and ratios, correspond to the oneslabeled by the random forest as informative. Thisalso serves as validation for the use of rule learn-ing.

Table 4: Metrics for the Connected Componentclassifier

AUC TPR@FPR=1% FPR@TPR=50%90.38% 66.6% 0.6%

Table 5: Most Informative Features

Top 5 Features1 Posting Months2 Posting Weeks3 Std-Dev. of Image Frequency4 Norm. No. of Names5 Norm. No. of Unique Images

(a) This pair of ads have extremely similar textual content in-cluding use of non-latin and special characters. The ad also ad-vertises the same individual, as strongly evidenced by the com-mon alias, ‘Paris’.

(b) The first ad here does not include any specific names of indi-viduals. However, The strong textual similarity with the secondad and the same advertised cost, helps to match them and dis-cover the individuals being advertised as ‘Nick’ and ‘Victoria’.

(c) While this pair is not extremely similar in terms of language,however the existence of the rare alias ‘SierraDayna’ in bothadvertisemets helps the classifier in matching them. This matchcan also easily be verified by the similar language structure ofthe pair.

(d) The first advertisement represents entities ‘Black China’ and‘Star Quality’, while the second advertisement, reveals that thepictures used in the first advertisement are not original and be-long to the author of the second ad. This example pair showsthe robustness of our match function. It also reveals how com-plicated relationships between various ads can be.

Figure 8: Representative results of advertisementpairs matched by our classifier. In all the fourcases the advertisement pairs had no phone num-ber information (strong feature) in order to detectconnections. Note that sensitive elements havebeen intentionally obfuscated.

JessicaMontgomery, AL

KimberlyMontgomery, ALWilkes-Barre, PA

Amber/DesireMontgomery, AL

Eve/Eden/DesireMontgomery, AL

NicoleMontgomery, AL

MonicaScranton, PA

CandiceMontgomery, AL

StephanieScranton, PA

Montgomery, AL

Figure 11: Representative Entity isolated by ourpipeline, believed to be involved in human traffick-ing. The nodes represent advertisements, whilethe edges represent links between advertisements.This entity has 802 nodes and 39,383 edges. Thisvisualization is generated using Gephi. (Bastianet al., 2009). This entity operated in cities, acrossstates and advertised multiple different individu-als along with multiple phone numbers. This sug-gests a more complicated and organised activityand serves as an example of how complicated cer-tain entities can be in this trade.

.

5 Conclusion

In this paper we approached the problem of isolat-ing sources of human trafficking from online es-cort advertisements with a pairwise Entity Resol-tuion approach. We trained a classifier able topredict if two advertisements are from the samesource using phone numbers as a strong featureand exploit it as proxy ground truth to generatetraining data for our classifier. The resultant clas-sifier, proved to be robust, as evidenced from ex-tremely low false positive rates. Other appro-raches (Szekely et al., 2015) aims to build sim-ilar knowledge graphs using similarity score be-tween each feature. This has some limitations.Firstly, we need labelled training data inorder totrain match functions to detect ontological rela-tions. The challenge is aggravated since this ap-proach considers each feature independently mak-ing generation of enough labelled training datafor training multiple match functions an extremelycomplicated task.

Since we utilise existing features as proxy evi-dence, our approach can generate a large amountof training data without the need of any humanannotation. Our approach requires just learninga single function over the entire featureset, henceour classifier can learn multiple complicated rela-tions between features to predict a match, insteadof the naive feature independence assumption.

We then proceeded to use this classifier in or-der to perform entity resolution using a heuresti-cally learned value for the score of classifier, as thematch threshold. The resultant connected compo-nents were again featurised, and a classifier modelwas fit before subjecting to rule learning. On com-parison with (Dubrawski et al., 2015), the con-nected component classifier performs a little bet-ter with higher values of the area under the ROCcurve and the TPR@FPR=1% indicating a steeper,ROC curve. We hypothesize that due to the en-tity resolution process, we are able to generatelarger, more robust amount of training data whichis immune to the noise in labelling and results ina stronger classifier. The learnt rules show highratios and lift for reasonably high supports as ev-idenced from Table 3. Rule learning also addsan element of interpretability to the models webuilt, and as compared to more complex ensemblemethods like Random Forests, having hard rulesas classification models are preferred by DomainExperts to build evidence for incrimination.

6 Future Work

While our blocking scheme performs well to re-duce the number of comparisons, however sinceour approach involves naive pairwise compar-isons, scalability is a significant challenge. Oneapproach could be to design such a pipeline in adistributed environment. Another approach couldbe to use a computationally inexpensive techniqueto de-duplicate the dataset of the near duplicateads, which would greatly help with regard to scal-ability.

In our approach, the ER process depends uponthe heuristically learnt match threshold. Lowerthreshold values can significantly degrade the per-formance, with extremely large connected compo-nents. The possibility of treating this attribute asa learning task, would help making this approachmore generic, and non domain specific.

Hashcodes of the images associated with theads were also utilized as a feature for the match

function. However, simple features like number ofunique and common images etc., did not prove tobe very informative. Further research is requiredin order to make better use of such visual data.

Acknowledgments

The authors would like to thank all staff, fac-ulty and students who made the Robotics InstituteSummer Scholars program 2015 at Carnegie Mel-lon University possible.

References

Mathieu Bastian, Sebastien Heymann, and Math-ieu Jacomy. 2009. Gephi: An open sourcesoftware for exploring and manipulating networks.http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.

Omar Benjelloun, Hector Garcia-Molina, David Men-estrina, Qi Su, Steven Euijong Whang, and JenniferWidom. 2009. Swoosh: a generic approach to en-tity resolution. The VLDB JournalThe InternationalJournal on Very Large Data Bases 18(1):255–276.

Hamish Cunningham. 2002. Gate, a general architec-ture for text engineering. Computers and the Hu-manities 36(2):223–254.

Artur Dubrawski, Kyle Miller, Matthew Barnes,Benedikt Boecking, and Emily Kennedy. 2015.Leveraging publicly available data to discern pat-terns of human-trafficking activity. Journal of Hu-man Trafficking 1(1):65–85.

Emily Kennedy. 2012. Predictive patterns of sex traf-ficking online .

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer,Ron Weiss, Vincent Dubourg, Jake Vanderplas,Alexandre Passos, David Cournapeau, MatthieuBrucher, Matthieu Perrot, and Edouard Duch-esnay. 2011. Scikit-learn: Machine learningin python. J. Mach. Learn. Res. 12:2825–2830.http://dl.acm.org/citation.cfm?id=1953048.2078195.

Pedro Szekely, Craig A. Knoblock, Jason Slepicka,Andrew Philpot, Amandeep Singh, Chengye Yin,Dipsy Kapoor, Prem Natarajan, Daniel Marcu,Kevin Knight, David Stallard, Subessware S.Karunamoorthy, Rajagopal Bojanapalli, StevenMinton, Brian Amanatullah, Todd Hughes, MikeTamayo, David Flynt, Rachel Artiss, Shih-FuChang, Tao Chen, Gerald Hiebel, and Lidia Fer-reira. 2015. Building and using a knowledge graphto combat human trafficking. In Proceedings of the14th International Semantic Web Conference (ISWC2015).

http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154



http://dl.acm.org/citation.cfm?id=1953048.2078195



Sriharsha Veeramachaneni and Ravi Kumar Kon-dadadi. 2009. Surrogate learning: from feature in-dependence to semi-supervised classification. InProceedings of the NAACL HLT 2009 Workshopon Semi-Supervised Learning for Natural LanguageProcessing. Association for Computational Linguis-tics, pages 10–18.

Date post:	01-Jan-2022
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

arXiv:1509.06659v3 [cs.SI] 18 Jun 2017

Documents