Forensic Collection of Websites

caseyl4n6 · January 21, 2022, 7:47pm

I have been asked to capture an entire website. couple of single pages would’ve been no problem but not sure how to forensically capture an entire site (so many pages) in a defensible way.

Does anyone have suggestions? What tools to use and how to capture a website with all of its pages?

agungor · January 21, 2022, 8:02pm

This is a question I get quite often. So, I will jump in and make a few suggestions to get the ball rolling

Forensic Acquisition of Websites (FAW)

FAW is a commercial tool from Italy that goes into great depth in this area. I recommend reviewing their features overview and user manual to learn more about the tool’s capabilities such as Wireshark integration

Browsertrix Crawler

Runs Brave windows in a Docker container with a long list of options such as screenshotting web pages and screencasting the crawling in real time.

Command-line Options:

One cool feature is that it is possible to pre-configure a browser profile by logging into a site to be collected ahead of time.

Related Project: Webrecorder

(Suggested below by @MOHAMEDALI_HAMOUDA)

HTTrack

HTTrack is an excellent tool for mirroring websites with countless options. You can find Fred Cohen’s comprehensive guide on its usage here:

Cohen offers the following snippet to achieve a “forensic dump” of a website:

 httrack "www.website.com/" -O "/tmp/www.website.com" -R5H0Ko0s0zZd %H -V "md5 \$0" "+*.website.com/*"

I strongly recommend dissecting the details and understanding every parameter here rather than using it verbatim. For example, going off of Cohen’s suggestions:

Option	Description
O	Specifies output location.
R5	Controls the number of retries. In this case, 5 retry attempts.
H0	Controls when to adandon. 0 indicates never—you may want to consider this carefully!
K	Keeps original links.
o0	Disables generating output html file in case of error.
s0	Never follows robots.txt and meta robots tags.
z	Logs extra info.
Z	Logs debug information.
d	Stays on the same principal domain.
%H	HTTP headers in logfile.
V	Executes a command after each file ($0 is the filename). In this case, we are running the md5 command on each file to hash it. Adjust this as needed (e.g., use hashdeep, etc.).
+.website.com/	Indicates we should stay within the principal domain.

Using the e option rather than d would tell the tool to go everywhere on the web as needed. If you choose that option, I would recommend setting up a depth limit.

Also, be mindful of any differences between the Linux and Windows versions of HTTrack (i.e., HTTrack v. WinHTTrack).

MAGNET Web Page Saver

WPS is a free tool that takes a list of URLs and acquires scrolling snapshots of each page (e.g., PNG, PDF, etc.). Very handy for capturing how a web page looked at a certain point in time.

https://www.magnetforensics.com/resources/web-page-saver/

Hunchly

Hunchly has an OSINT focus and allows you to build a case as you visit web pages. I would recommend taking a look if you plan to preserve web pages as you visit them, rather than batch-preserve an entire website.

https://www.hunch.ly

Vortimo OSINT-Tool

Another OSINT tool that is a Chrome Extension + webapp combo which allows collecting artifacts and analyzing web content.

Trusted Timestamping

Trusted timestamping is typically an important part of forensic capture of websites as the goal is often to memorialize what a website contained—or did not contain—at a specific point in time. We have some information on how to accomplish this with open-source tools here:

I hope this helps get you started. Looking forward to hearing suggestions from others as well

paolo · February 16, 2022, 9:43am

Nice topic, I’m particularly interested in this field: to whom it may concern, I just held a presentation for OSDFCON Conference in December 2021 about " Forensic Acquisition of Websites, Webpages and Online Services with Open Source Tools" where I explain how to perform a forensic acquisition of websites and webpages for free, with open source tools (comparing also some state of the art tools and services, which by the way I currently use).

Slides here: https://www.osdfcon.org/events_2021/forensic-acquisition-of-websites-webpages-and-online-services-with-open-source-tools/

Video here:

Any other tools, protocols or methods are welcome, I’m doing a lot of research comparing tools, service providers, and protocols and soon I’ll write some write up about my findings.

Paolo Dal Checco (Forenser Srl)

agungor · February 16, 2022, 4:34pm

Paolo,

This looks wonderful. Many thanks for sharing, and looking forward to your writeups!

frllc · January 10, 2024, 7:15pm

And if you are like me and only need to do this every so often, FAW has an On-Demand license for $275 (as of today), which works for 24 hours. If you’re in the United States, I can confirm that Teel Technologies sells this license. FAW does not sell directly, as far as I can tell.

MOHAMEDALI_HAMOUDA · March 6, 2025, 3:03am

am not having experience with digital forensics. but sometimes I like reading about it for personal-learning. graduated from the faculty of languages and translation, but in web archiving am using “browsertrix” GitHub - webrecorder/browsertrix-crawler: Run a high-fidelity browser-based web archiving crawler in a single Docker container. its the best tool out there. the output file WARC and WACZ. also “zimit” for offline reading GitHub - openzim/zimit: Make a ZIM file from any Web site and surf offline! (zim format). it creates archive folder that contain WARC files as well. if you need further info or guides, tell me in the comments.

barefootforensics · March 10, 2025, 7:37pm

I am new to the community so just seeing this post. We do this frequently and use Web Preserver by Pagefreezer. It is not cheap but it works quite well and creates a hash value for every page preserved. It also provides various output formats which is also nice. www.pagefreezer.com.

agungor · March 10, 2025, 10:49pm

Thanks, @MOHAMEDALI_HAMOUDA. Updated the original response to include Browsertrix Crawler