I have been asked to capture an entire website. couple of single pages would’ve been no problem but not sure how to forensically capture an entire site (so many pages) in a defensible way.
Does anyone have suggestions? What tools to use and how to capture a website with all of its pages?
This is a question I get quite often. So, I will jump in and make a few suggestions to get the ball rolling
Forensic Acquisition of Websites (FAW)
FAW is a commercial tool from Italy that goes into great depth in this area. I recommend reviewing their features overview and user manual to learn more about the tool’s capabilities such as Wireshark integration
HTTrack
HTTrack is an excellent tool for mirroring websites with countless options. You can find Fred Cohen’s comprehensive guide on its usage here:
Cohen offers the following snippet to achieve a “forensic dump” of a website:
I strongly recommend dissecting the details and understanding every parameter here rather than using it verbatim. For example, going off of Cohen’s suggestions:
Option
Description
O
Specifies output location.
R5
Controls the number of retries. In this case, 5 retry attempts.
H0
Controls when to adandon. 0 indicates never—you may want to consider this carefully!
K
Keeps original links.
o0
Disables generating output html file in case of error.
s0
Never follows robots.txt and meta robots tags.
z
Logs extra info.
Z
Logs debug information.
d
Stays on the same principal domain.
%H
HTTP headers in logfile.
V
Executes a command after each file ($0 is the filename). In this case, we are running the md5 command on each file to hash it. Adjust this as needed (e.g., use hashdeep, etc.).
+*.website.com/*
Indicates we should stay within the principal domain.
Using the e option rather than d would tell the tool to go everywhere on the web as needed. If you choose that option, I would recommend setting up a depth limit.
Also, be mindful of any differences between the Linux and Windows versions of HTTrack (i.e., HTTrack v. WinHTTrack).
MAGNET Web Page Saver
WPS is a free tool that takes a list of URLs and acquires scrolling snapshots of each page (e.g., PNG, PDF, etc.). Very handy for capturing how a web page looked at a certain point in time.
Hunchly has an OSINT focus and allows you to build a case as you visit web pages. I would recommend taking a look if you plan to preserve web pages as you visit them, rather than batch-preserve an entire website.
Trusted timestamping is typically an important part of forensic capture of websites as the goal is often to memorialize what a website contained—or did not contain—at a specific point in time. We have some information on how to accomplish this with open-source tools here:
I hope this helps get you started. Looking forward to hearing suggestions from others as well
Nice topic, I’m particularly interested in this field: to whom it may concern, I just held a presentation for OSDFCON Conference in December 2021 about " Forensic Acquisition of Websites, Webpages and Online Services with Open Source Tools" where I explain how to perform a forensic acquisition of websites and webpages for free, with open source tools (comparing also some state of the art tools and services, which by the way I currently use).
Any other tools, protocols or methods are welcome, I’m doing a lot of research comparing tools, service providers, and protocols and soon I’ll write some write up about my findings.
And if you are like me and only need to do this every so often, FAW has an On-Demand license for $275 (as of today), which works for 24 hours. If you’re in the United States, I can confirm that Teel Technologies sells this license. FAW does not sell directly, as far as I can tell.