Post on 16-Mar-2020
transcript
CyberScan (Online IP Infringement Detection Service)
J u l y 2 0 1 1
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
TABLE OF CONTENTS
Abstract ............................................................................................. 3
Abbreviations .................................................................................... 4
The Problem ...................................................................................... 5
Business and Technical Challenges ................................................. 7
CyberScanSolution ........................................................................... 9
Key Features ................................................................................... 11
Key Capabilities .............................................................................. 12
How CyberScan Works ................................................................... 13
Business Impact Examples ............................................................. 15
Conclusion....................................................................................... 16
Author Info ....................................................................................... 16
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
3
Abstract
Among the most profitable modern businesses on the internet today
are media and content providers, and the biggest threat to their
profits is from piracy of their copywritten products. Companies,
especially, in the entertainment, software, and publishing industries
continue to lose profits from the proliferation of pirated content being
available on a vast number of sites illegally.
HCL, a leading global IT service companyhas now tapped its
proprietary skills and tools to develop a software solution that seeks
out and protects against illegal hosting or linking of sold material.
CyberScan brings online copyright infringement from a revenue loss
into an automated evidence collector for direct action, and it uses
the same stealth-like methods as the criminals do to identify and
protect against illegal postings or links to “hacked” material.
CyberScan provides an innovative and highly effective online
copyright infringement detection service to help businesses reduce
profit loss. Its state-of-the-art software uses web crawling, tracking
and indexing, distributed agent „sniffers‟, and IP masking to identify,
monitor, and report infringed content in a rapid but undetectable
manner. It even bypasses the typical methods piracy sites use to
hide from or fight off such detection. CyberScan employs its special
technical methods automatically, reducing the cost of manually
finding and tracking infringement or compliance.
Infringement of any IP or copywritten content that can be sold and
shared digitally - movies, pictures, audio files, eBooks, TV shows,
documents, and software - can now be identified and addressed
with CyberScan. Your business can finally fight online piracy of your
content in an effective manner. CyberScan from HCL is a powerful
tool and a major benefit for any business combating copyright
infringement.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
4
Abbreviations
Sl. No. Acronyms (Page No.) Full form
1 IP(1) Intellectual property
2 URL(5) Uniform Resource
Locator
3 AWS(9) Amazon Web
Services
4 SaaS(11) Software as a Service
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
5
The Problem
Online piracy and infringement of copywritten material continues to
grow as a problem - and profit loss - for businesses more than retail
theft in stores. The problem businesses face today in trying to
combat online piracy include the following:
Identifying infringing content from a vast rising number of
hosting/linking sites known or unknown to exist.
Searching, finding, and filtering through content in a timely
matter even though it is propagated quickly across the
internet.
Staffing manual operators to perform search and detection
of infringing material, or hiring developers skilled in
particular logic and coding algorithms for it.
Evading criminal website administrator techniques like
blocking IP addresses based on number of hits so they
avoid manual or scripted detection systems.
Poking through authentication techniques used by piracy
site administrators to protect and firewall their illegal content.
Issuing Cease and Desist or Takedown Notices to an ever-
growing number of dynamic hosting sites that change their
URLs and addresses.
Fingerprinting infringement as evidence and enforcing
compliance of removal after discovery or serving notice.
Reporting reasonable data out of the huge volumes of content found to derive infringement patterns, assess perpetrators, and make business decisions.
Meeting The Challenge Businesses currently trying to solve these problems of content piracy and distribution find tall challenges and roadblocks to their efforts, but CyberScan solves them:
Challenge Short Description CyberScan Solution
URL Obfuscation Sites hide pirate links with format or layout tricks
Intelligent search expressions see through formats
Website Authentication Sites require login or user credentials to access content
Sites are categorized and login credentials used for automation
Infringement Detection Rapid changes and posts make finding and monitoring timely
Special search methods seek, tag, and monitor based on site
Web Crawler Obstruction Site admins limit, watch, and block detecting programs/users
Jobs spawn across globe from new addresses to remain in stealth
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
6
The next 2 sections go into further detail on these business and technical challenges, and precisely how CyberScan provides the best solution to them available today.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
7
Business and Technical Challenges
Trying to stop online piracy and illegal distribution of content on the
internet is nothing new. Like hiring security guards for a store front,
combating online theft can be both costly and have unique
challenges. Further, the criminal sites respond to business attempts
to find and remove illegitimate and illegal content with increasing
technical sophistication. Not only must the sites hosting pirated
material be identified, but the sites that link to their hacked content.
Each of the challenges listed are described in more detail below,
and the next section discusses in more detail how CyberScan
solves them:
URL Obfuscation
Websites that contain links to infringing content – from forums and
blogs to search engines - commonly use various obfuscation
techniques to prevent automated systems from detecting
infringement. Tactics of these linking sites include posting plain text
URLs instead of hyper-linked ones, replacing characters inside
URLs in a way a human can identify but not a computer, using third
party URL shortening services, and requiring registration to view
content or posts. These URL obfuscation methods are all a
challenge to a company trying to search and identify the sites that
serve as a link or entry point to pirated material.
Website Authentication
Some infringing or linking websites require registration before
content can be browsed. This closed-door firewall tactic is
particularly difficult to address because there are no standard
methods of authentication across the web. Many sites use
customized form-based authentication, which any manual or
programmed web crawler or sniffer must handle. To further
complicate matters, some linking websites allow anonymous access
to only small portions of the site, or require their own authentication
before users can view links to infringing content and downloads.
Infringing Content Detection
Identification of infringement hidden inside unstructured content is a
serious challenge, especially given the dynamic linking nature of the
Internet and the frequency of new or updated posts. Ability to detect
infringing content within hours of being posted is a desired capability
of any detection system. A specialized approach to content
detection and crawling methodologies that not only seeks and finds,
but also continually monitors any type of website, is a great
Websites hosting
infringing content are
responding to business
attempts to find and
remove illegitimate and
illegal content with
increasing technical
sophistication.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
8
challenge that must be met to protect pirated content from
spreading.
Webcrawler Obstruction
Some linking sites take proactive and even reactive measures to
hinder automated systems and manual sniffing for pirated content.
Tactics includes blocking IP addresses according to their own
criteria, user agent strings, and enforcing page view limits and
quotas. These obstructions may occur programmatically or via
manual intervention by the website administrators. A great
challenge in crawling the web manually or automatically is to remain
in a stealth mode so you can continue to detect and monitor IP
infringement while remaining undetected yourself.
Developed with these problems in mind, CyberScan provides key
logic and proven components, allowing automated solutions to
these unique challenges and more.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
9
CyberScanSolution
CyberScan directly addresses the business and technical challenges in the piracy prevention sphere outlined above in the following specific ways:
URL Obfuscation
Infringing sites use cover methods like changing or masking their
URLs, clouding the links to their site, or requiring registration to
continue.
CyberScan‟s custom webcrawler logic intelligently applies regular
expressions to detect and process host site URLs inside of
unstructured web content. CyberScan also supports custom website
authentication mechanisms, which enables crawling entire domains
under the guise of being of a registered user. This combination
effectively deals with most forms of URL obfuscation.
Website Authentication
Sites hosting pirated content often require user authentication,
keeping their illegal wares behind a closed and locked door.
Using sophisticated analysis algorithms, linking sites are classified
based on a wide range of criteria so the best applicable approach is
selected. Nutch web crawler‟s authentication modules have been
extended to support form-based authentication. Credentials
gathered from manual site registration are supplied through the
CyberScan Web Application and are used while crawl jobs are
underway. This allows CyberScan web crawlers to access and
analyze areas of suspect websites typical search engines are
unable reach.
Infringing Content Detection
Infringing content can be hidden and changed by new posts,
updates, propagation, and by the fast dynamic nature of the internet.
CyberScan uses a combination of weighted regular expressions to
detect infringing content. While within a website known to serve
content suspected of infringing, the program is stricter in
determining infringement possibilities. Within less known or new
sites, search logic can also be applied. If for example the body of a
post matches a regular expression designed to detect a customer‟s
content title, and the URL also contains a particular flagged string,
the code can accurately determine if the post is infringing or not.
CyberScan offers a
comprehensive solution
to online infringement by
effectively leveraging the
best of breed open source
technologies and power of cloud computing.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
10
Different crawling methodologies have been implemented
depending on the layout of sites. For example, CyberScan in forum
style websites attempts to use the site‟s search functionality to sniff
and crawl links that have a high probability of infringement. In other
site styles where search is not available, CyberScan crawls the
index pages to analyze each post according to its logic. CyberScan
will find infringement when it is there.
Webcrawler Obstruction
Pirate site administrators react and try to block access or views by
legitimate enforcers, either manually or with programs that detect
who is trying to detect them.
CyberScan conducts its crawls inside of Amazon's Elastic
MapReduce AWS service. Each crawl job is conducted in a newly
provisioned cluster, each using a different IP address and
geographical location. The user agent string is set to the most
common browsers/platforms on the web. To circumvent server-side
page view limits or quotas, the client crawl jobs are configured to be
low impact and “polite” to the web servers. By crawling the targeted
sites in large but distributed jobs, the load is spread across the
entire World Wide Web, while the system actively searches for
infringement using varying aliases. This complex combination helps
to keep the web crawlers under the radar of website administrators,
and makes CyberScan very difficult to identify and block.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
11
Key Features
CyberScan finds pirated content and the sites that provide paths to
it in a way no other software can – effectively, secretly, and
automatically. CyberScan‟s key features and benefits include the
following:
Automatic identification of suspected infringement on
intellectual property
Dynamic detection though multiple geographies to remain in
“stealth mode”
Savvy “crawl/ sniff” logic that remains undetected by pirate
administrators
Full evidence capture and archival for legal establishment of
infringement
Thorough and fully automated domain traversal, parsing,
and indexing
Powerful multi-faceted search for drilling into indexed
content
Adaptive tracking of detected sites to ensure removal and
compliance
Live feeds detailing newly discovered infringement
Cross-category crawls of sites and specific sniffing posed
as a legitimate user
Interactive web interface for system monitoring and control
Prevalence analysis and reporting of pirated content and its
service providers
Highly scalable and reliable cloud architecture using proven
open source modules
CyberScan’s rich featureset and capabilities around the 4 efficient parts: Identification, Evidence collection, Reporting Infringement and Re-verification provides 3600 protection for your content.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
12
Key Capabilities
CyberScan was developed by HCL experts to include key
capabilities and utilize proven components to specifically address
content piracy concerns of provider businesses and their technical
staff. The following are some highlights:
Customized proprietary version of the Nutch open source
web crawler
Proven cloud infrastructure utilizing Amazon Web Services
(AWS) to deploy/run
Advanced AWS services like elastic clusters for highly
scalable and reliable system
Dynamic resource allocation across multiple domains,
locations, and user strings, enabling CyberScan to work in
an undetected stealth mode
Fully indexed suspicious domain lists and multi-faceted
search results through a custom search engine UI, useful
for research, analysis, and reporting
Adaptive revisits to suspicious content download and link
pages, to detect when they are removed and ensure
compliance
Coding logic that ensures “politeness” to servers being
sniffed, to ensure web crawlers resemble normal users un-
noticed by reactive pirate site administrators
Cloud-based architecture that allows for global efficiency
and Software As a Service (SaaS) pay per use billing model.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
13
The CyberScan Difference
CyberScan‟s solution stands out far when compared to any mix of
software for its features and capabilities; the following are some of
the key benefits HCL adds when partnering with them to use the
CyberScan solution and services:
How CyberScan Works
The secret to CyberScan‟s profit-saving features and benefits lies in
HCL‟s selection and customization of technologies that can together
perform the job required to quietly and efficiently detect copyright
infringement and propagation. HCL found niche open source
computing platforms and customized them, added an intelligent
architecture geared for IP detection tasks, and tapped the power of
the cloud. The result is a differentiating feature set outlined above.
The following are some of the technologies and components used in
this unique HCL assembly and coding:
Java – programming language and computing platform.
Nutch – a multi-threaded web crawler capable of full web
scale indexing, serving as the core sniffer/crawling
technology
SOLR – an enterprise grade search platform, constructs a
full text searchable index of the content crawled by Nutch
Lucene – text search engine library, used by Nutch and
SOLR
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
14
Hadoop – a powerful distributed computing framework,
breaks large computational jobs into manageable fragments
to be run in parallel on many servers.
The following diagram further depicts some of the back-end
components (the Hadoop layer) of CyberScan‟s architecture.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
15
Business Impact Examples
The innovation behind this powerful new tool has already led to the
following business impact on beta-testing and initial customers.
Listed here only as examples is how your business can rely on
similar success:
A customer has realized more than 95% infringement
detection accuracy.
A customer realized a 30% cost saving compared to its
existing mix of service providers, with added benefits a
single source, HCL, for the new provisions.
Customers express eagerness about a pay-per-use
scheme, allowing them to worry only about their business
while HCL takes care of the engineering, technology
innovation, maintenance, support and research.
A customer division, based on is resounding success with
CyberScan, is now introducing the solution to all Business
Units and select partners of its company.
CyberScan (Online IP Infringement Detection Service) | July 2011
© 2011, HCL Technologies. Reproduction Prohibited. This document is protected under Copyright by the Author, all rights reserved.
16
Conclusion
HCL CyberScan can help any business protect their key intellectual
property. As online piracy continues to grow exponentially,
companies must remain vigilant with technology to minimize
copyright infringement and its resulting profit loss. Our unique
solution is a new and effective way to combat online piracy, IP theft,
and illegal distribution by using automation and the latest
internet/cloud technologies.
Let CyberScan stop piracy and secure your profits.
For more on how HCL CyberScan can benefit your organization,
contact us at cyberscan@hcl.com
Author Info
Michael Grucz
Technical Lead Research,
Internet Security Business Unit
Kiran Kumar Reddy . V
Product Manager,
Internet Security Business Unit.
CyberScan is a effective
and cost efficient solution
to combat online piracy
and reducing your profit
loss.
Hello, I’m from HCL’s Engineering and R&D Services. We enable technology led organizations to go to market with innovative products & solutions. We partner with our customers in building world class products & creating the associated solution delivery ecosystem to help build market leadership. Right now, 14500+ of us are developing engineering products, solutions and platforms across Aerospace and Defense, Automotive, Consumer Electronics, Industrial Manufacturing, Medical Devices, Networking & Telecom, Office Automation, Semiconductor, Servers & Storage for our customers.
For more details contact eootb@hcl.com
Follow us on twitter http://twitter.com/hclers and our blog http://ers.hclblogs.com/
Visit our website http://www.hcltech.com/engineering-services/
About HCL
About HCL Technologies HCL Technologies is a leading global IT services company, working with clients in the areas that impact and redefine the core of their businesses. Since its inception into the global landscape after its IPO in 1999, HCL focuses on „transformational outsourcing‟, underlined by innovation and value creation, and offers integrated portfolio of services including software-led IT solutions, remote infrastructure management, engineering and R&D services and BPO. HCL leverages its extensive global offshore infrastructure and network of offices in 26 countries to provide holistic, multi-service delivery in key industry verticals including Financial Services, Manufacturing, Consumer Services, Public Services and Healthcare. HCL takes pride in its philosophy of „Employee First‟ which empowers our 72,267 transformers to create a real value for the customers. HCL Technologies, along with its subsidiaries, had consolidated revenues of US$ 3.1 billion (Rs. 14,101 crores), as on 31
st
December 2010 (on LTM basis). For more information, please visit www.hcltech.com
About HCL Enterprise HCL is a $5.9 billion leading global technology and IT enterprise comprising two companies listed in India - HCL Technologies and HCL Infosystems. Founded in 1976, HCL is one of India's original IT garage start-ups. A pioneer of modern computing, HCL is a global transformational enterprise today. Its range of offerings includes product engineering, custom & package applications, BPO, IT infrastructure services, IT hardware, systems integration, and distribution of information and communications technology (ICT) products across a wide range of focused industry verticals. The HCL team consists of over 80,000 professionals of diverse nationalities, who operate from 31 countries including over 500 points of presence in India. HCL has partnerships with several leading Global 1000 firms, including leading IT and technology firms. For more information, please
visit www.hcl.com