PREDICT Project Overview
Researchers require current data on Internet security threats,
including samples of normal and malicious Internet traffic, malicious
software samples, and logs from machines compromised in targeted
attacks, and other data to develop hardware and software that
protects against and mitigates the effects of hacking attempts and
malicious software. Concerns over privacy, security, proprietary
information, and legal risks make collection and distribution of
such data difficult for the owners of the infrastructure, owners
of data, collectors of data, and distributors of data. Thus, few
organizations make datasets available for the development and testing
of defensive technologies.
The Department of Homeland Security (DHS) has developed the
Protected Repository for the Defense of Infrastructure Against Cyber Threats (PREDICT) project to provide vetted researchers with current network
operational data in a secure and controlled manner that respects
the security, privacy, legal, and economic concerns of Internet users and
network operators. You can learn more about PREDICT in the Overview of the PREDICT program (DHS.gov PDF document).
The DHS established the PREDICT project to meet three primary goals.
- To provide a Web-based portal to vetted researchers that catalogs current computer network and operational
data and provides data request infrastructure.
- To provide secure access to multiple sources of data collected
as a result of use of and traffic on the Internet.
- To facilitate data flow among PREDICT participants for the purpose
of developing new models, technologies and products that support
effective threat assessment and increase cyber security capabilities.
CAIDA's Role in the PREDICT Project
CAIDA has been involved with the development of the PREDICT program
since its inception; CAIDA personnel have served in an advisory
capacity on all committees developing and implementing PREDICT
processes and procedures. CAIDA participates in the PREDICT program
as a Data Provider via the collection of routing data, peering point
passive traces, and denial-of-service attack and Internet worm data
from the UCSD Network Telescope. CAIDA is also Data Host, serving that
data to researchers who have been vetted and approved through the
PREDICT program. Through its Data Host and Data Provider roles, a
CAIDA representative will serve on the PREDICT Application Review
and Publication Review Boards that involve data that CAIDA collects
or distributes.
Major project activities include:
- collection, documentation, anonymization, and distribution of routing, peering point, and UCSD Network Telescope data,
- continuing to advise on technical, legal, and practical aspects of PREDICT policies and procedures, and
- creating an index of anonymization techniques and advantages and pitfalls of using them on Internet datasets.
Data Sets
CAIDA released the following datasets with support from DHS.
The creation of some of these data sets was cost-shared with
other federal agency and private sector funding sources.
- Backscatter-TOCS Dataset (November 5, 2005)
http://www.caida.org/data/passive/backscatter_tocs_dataset.xml
- Witty Worm Dataset (January 11, 2006)
http://www.caida.org/data/passive/witty_worm_dataset.xml
- Code-Red Worms Dataset (February 1, 2006)
http://www.caida.org/data/passive/codered_worms_dataset.xml
- Backscatter 2004-2005 Dataset (February 15, 2006)
http://www.caida.org/data/passive/backscatter_2004_2005_dataset.xml
- Backscatter-2006 Dataset (April 7, 2006)
http://www.caida.org/data/passive/backscatter_2006_dataset.xml
- AS Taxonomy Repository (April 14, 2006)
http://www.caida.org/data/active/as_taxonomy/
- DNS Root Server/gTLD RTT Dataset (August 3, 2006)
http://www.caida.org/data/passive/dns_root_gtld_rtt_dataset.xml
- Backscatter-2007 Dataset (June 28, 2007)
http://www.caida.org/data/passive/backscatter_2007_dataset.xml
- Anonymized 2007 Internet Traces Dataset (January 15, 2008)
http://www.caida.org/data/passive/passive_2007_dataset.xml
- IPv4 Routed /24 Topology Dataset (February 1, 2008)
http://www.caida.org/data/active/ipv4_routed_24_topology_dataset.xml
- Backscatter-2008 Dataset (March 26, 2008)
http://www.caida.org/data/passive/backscatter_2008_dataset.xml
- IPv4 Routed /24 AS Links Dataset (March 31, 2008)
http://www.caida.org/data/active/ipv4_routed_topology_aslinks_dataset.xml
- Anonymized 2008 Internet Traces Dataset (June 6, 2008)
http://www.caida.org/data/passive/passive_2008_dataset.xml
We developed the CAIDA Data FAQ to address questions from researchers.
We also provide information on data sharing and anonymization, including a list of relevant papers.
Previous funding
Previous PREDICT-related efforts were funded under DHS contract
NBCHC040159: "Network Traffic Data Repository to Develop Secure IT
Infrastructure."