Forensic Collection of Websites

I have been asked to capture an entire website. couple of single pages would’ve been no problem but not sure how to forensically capture an entire site (so many pages) in a defensible way.

Does anyone have suggestions? What tools to use and how to capture a website with all of its pages?

This is a question I get quite often. So, I will jump in and make a few suggestions to get the ball rolling :smiley:

Forensic Acquisition of Websites (FAW)

FAW is a commercial tool from Italy that goes into great depth in this area. I recommend reviewing their features overview and user manual to learn more about the tool’s capabilities such as Wireshark integration :+1:t2:

HTTrack

HTTrack is an excellent tool for mirroring websites with countless options. You can find Fred Cohen’s comprehensive guide on its usage here:

Cohen offers the following snippet to achieve a “forensic dump” of a website:

 httrack "www.website.com/" -O "/tmp/www.website.com" -R5H0Ko0s0zZd %H -V "md5 \$0" "+*.website.com/*" 

I strongly recommend dissecting the details and understanding every parameter here rather than using it verbatim. For example, going off of Cohen’s suggestions:

Option Description
O Specifies output location.
R5 Controls the number of retries. In this case, 5 retry attempts.
H0 Controls when to adandon. 0 indicates never—you may want to consider this carefully!
K Keeps original links.
o0 Disables generating output html file in case of error.
s0 Never follows robots.txt and meta robots tags.
z Logs extra info.
Z Logs debug information.
d Stays on the same principal domain.
%H HTTP headers in logfile.
V Executes a command after each file ($0 is the filename). In this case, we are running the md5 command on each file to hash it. Adjust this as needed (e.g., use hashdeep, etc.).
+*.website.com/* Indicates we should stay within the principal domain.

Using the e option rather than d would tell the tool to go everywhere on the web as needed. If you choose that option, I would recommend setting up a depth limit.

Also, be mindful of any differences between the Linux and Windows versions of HTTrack (i.e., HTTrack v. WinHTTrack).

MAGNET Web Page Saver

WPS is a free tool that takes a list of URLs and acquires scrolling snapshots of each page (e.g., PNG, PDF, etc.). Very handy for capturing how a web page looked at a certain point in time.

MAGNET Web Page Saver - Magnet Forensics

Hunchly

Hunchly has an OSINT focus and allows you to build a case as you visit web pages. I would recommend taking a look if you plan to preserve web pages as you visit them, rather than batch-preserve an entire website.

https://www.hunch.ly

Trusted Timestamping

Trusted timestamping is typically an important part of forensic capture of websites as the goal is often to memorialize what a website contained—or did not contain—at a specific point in time. We have some information on how to accomplish this with open-source tools here:

I hope this helps get you started. Looking forward to hearing suggestions from others as well :blush:

2 Likes

Nice topic, I’m particularly interested in this field: to whom it may concern, I just held a presentation for OSDFCON Conference in December 2021 about " Forensic Acquisition of Websites, Webpages and Online Services with Open Source Tools" where I explain how to perform a forensic acquisition of websites and webpages for free, with open source tools (comparing also some state of the art tools and services, which by the way I currently use).

Slides here: Forensic Acquisition of Websites, Webpages and Online Services with Open Source Tools - OSDFCon

Video here:

Any other tools, protocols or methods are welcome, I’m doing a lot of research comparing tools, service providers, and protocols and soon I’ll write some write up about my findings.

Paolo Dal Checco (Forenser Srl)

2 Likes

Paolo,

This looks wonderful. Many thanks for sharing, and looking forward to your writeups! :+1:t2: