Python Digital Network Forensics-II

Web Page Preservation with Beautiful Soup

The World Wide Web (WWW) is a unique resource of information. However, its legacy is at high risk due to the loss of content at an alarming rate. A number of cultural heritage and academic institutions, non-profit organizations and private businesses have explored the issues involved and contributed to the development of technical solutions for web archiving.

Web page preservation or web archiving is the process of gathering the data from World Wide Web, ensuring that the data is preserved in an archive and making it available for future researchers, historians and the public. Before proceeding further into the web page preservation, let us discuss some important issues related to web page preservation as given below −

Change in Web Resources − Web resources keep changing everyday which is a challenge for web page preservation.
Large Quantity of Resources − Another issue related to web page preservation is the large quantity of resources which is to be preserved.
Integrity − Web pages must be protected from unauthorized amendments, deletion or removal to protect its integrity.
Dealing with multimedia data − While preserving web pages we need to deal with multimedia data also, and these might cause issues while doing so.
Providing access − Besides preserving, the issue of providing access to web resources and dealing with issues of ownership needs to be solved too.

In this chapter, we are going to use Python library named Beautiful Soup for web page preservation.

What is Beautiful Soup?

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be used with urlib because it needs an input (document or url) to create a soup object, as it cannot fetch web page itself. You can learn in detail about this at www.crummy.com/software/BeautifulSoup/bs4/doc/

Note that before using it, we must install a third party library using the following command −

pip install bs4

Next, using Anaconda package manager, we can install Beautiful Soup as follows −

conda install -c anaconda beautifulsoup4

Python Script for Preserving Web Pages

The Python script for preserving web pages by using third party library called Beautiful Soup is discussed here −

First, import the required libraries as follows −

from __future__ import print_function
import argparse

from bs4 import BeautifulSoup, SoupStrainer
from datetime import datetime

import hashlib
import logging
import os
import ssl
import sys
from urllib.request import urlopen

import urllib.error
logger = logging.getLogger(__name__)

Note that this script will take two positional arguments, one is URL which is to be preserved and other is the desired output directory as shown below −

if __name__ == "__main__":
   parser = argparse.ArgumentParser('Web Page preservation')
   parser.add_argument("DOMAIN", help="Website Domain")
   parser.add_argument("OUTPUT_DIR", help="Preservation Output Directory")
   parser.add_argument("-l", help="Log file path",
   default=__file__[:-3] + ".log")
   args = parser.parse_args()

Now, setup the logging for the script by specifying a file and stream handler for being in loop and document the acquisition process as shown −

logger.setLevel(logging.DEBUG)
msg_fmt = logging.Formatter("%(asctime)-15s %(funcName)-10s""%(levelname)-8s %(message)s")
strhndl = logging.StreamHandler(sys.stderr)
strhndl.setFormatter(fmt=msg_fmt)
fhndl = logging.FileHandler(args.l, mode='a')
fhndl.setFormatter(fmt=msg_fmt)

logger.addHandler(strhndl)
logger.addHandler(fhndl)
logger.info("Starting BS Preservation")
logger.debug("Supplied arguments: {}".format(sys.argv[1:]))
logger.debug("System " + sys.platform)
logger.debug("Version " + sys.version)

Now, let us do the input validation on the desired output directory as follows −

if not os.path.exists(args.OUTPUT_DIR):
   os.makedirs(args.OUTPUT_DIR)
main(args.DOMAIN, args.OUTPUT_DIR)

Now, we will define the main() function which will extract the base name of the website by removing the unnecessary elements before the actual name along with additional validation on the input URL as follows −

def main(website, output_dir):
   base_name = website.replace("https://", "").replace("http://", "").replace("www.", "")
   link_queue = set()
   
   if "http://" not in website and "https://" not in website:
      logger.error("Exiting preservation - invalid user input: {}".format(website))
      sys.exit(1)
   logger.info("Accessing {} webpage".format(website))
   context = ssl._create_unverified_context()

Now, we need to open a connection with the URL by using urlopen() method. Let us use try-except block as follows −

try:
   index = urlopen(website, context=context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   logger.error("Exiting preservation - unable to access page: {}".format(website))
   sys.exit(2)
logger.debug("Successfully accessed {}".format(website))

The next lines of code include three function as explained below −

write_output() to write the first web page to the output directory
find_links() function to identify the links on this web page
recurse_pages() function to iterate through and discover all links on the web page.

write_output(website, index, output_dir)
link_queue = find_links(base_name, index, link_queue)
logger.info("Found {} initial links on webpage".format(len(link_queue)))
recurse_pages(website, link_queue, context, output_dir)
logger.info("Completed preservation of {}".format(website))

Now, let us define write_output() method as follows −

def write_output(name, data, output_dir, counter=0):
   name = name.replace("http://", "").replace("https://", "").rstrip("//")
   directory = os.path.join(output_dir, os.path.dirname(name))
   
   if not os.path.exists(directory) and os.path.dirname(name) != "":
      os.makedirs(directory)

We need to log some details about the web page and then we log the hash of the data by using hash_data() method as follows −

logger.debug("Writing {} to {}".format(name, output_dir)) logger.debug("Data Hash: {}".format(hash_data(data)))
path = os.path.join(output_dir, name)
path = path + "_" + str(counter)
with open(path, "w") as outfile:
   outfile.write(data)
logger.debug("Output File Hash: {}".format(hash_file(path)))

Now, define hash_data() method with the help of which we read the UTF-8 encoded data and then generate the SHA-256 hash of it as follows −

def hash_data(data):
   sha256 = hashlib.sha256()
   sha256.update(data.encode("utf-8"))
   return sha256.hexdigest()
def hash_file(file):
   sha256 = hashlib.sha256()
   with open(file, "rb") as in_file:
      sha256.update(in_file.read())
return sha256.hexdigest()

Now, let us create a Beautifulsoup object out of the web page data under find_links() method as follows −

def find_links(website, page, queue):
   for link in BeautifulSoup(page, "html.parser",parse_only = SoupStrainer("a", href = True)):
      if website in link.get("href"):
         if not os.path.basename(link.get("href")).startswith("#"):
            queue.add(link.get("href"))
   return queue

Now, we need to define recurse_pages() method by providing it the inputs of the website URL, current link queue, the unverified SSL context and the output directory as follows −

def recurse_pages(website, queue, context, output_dir):
   processed = []
   counter = 0
   
   while True:
      counter += 1
      if len(processed) == len(queue):
         break
      for link in queue.copy(): if link in processed:
         continue
	   processed.append(link)
      try:
      page = urlopen(link,      context=context).read().decode("utf-8")
      except urllib.error.HTTPError as e:
         msg = "Error accessing webpage: {}".format(link)
         logger.error(msg)
         continue

Now, write the output of each web page accessed in a file by passing the link name, page data, output directory and the counter as follows −

write_output(link, page, output_dir, counter)
queue = find_links(website, page, queue)
logger.info("Identified {} links throughout website".format(
   len(queue)))

Now, when we run this script by providing the URL of the website, the output directory and a path to the log file, we will get the details about that web page that can be used for future use.

Virus Hunting

Have you ever wondered how forensic analysts, security researchers, and incident respondents can understand the difference between useful software and malware? The answer lies in the question itself, because without studying about the malware, rapidly generating by hackers, it is quite impossible for researchers and specialists to tell the difference between useful software and malware. In this section, let us discuss about VirusShare, a tool to accomplish this task.

Understanding VirusShare

VirusShare is the largest privately owned collection of malware samples to provide security researchers, incident responders, and forensic analysts the samples of live malicious code. It contains over 30 million samples.

The benefit of VirusShare is the list of malware hashes that is freely available. Anybody can use these hashes to create a very comprehensive hash set and use that to identify potentially malicious files. But before using VirusShare, we suggest you to visit https://virusshare.com for more details.

Creating Newline-Delimited Hash List from VirusShare using Python

A hash list from VirusShare can be used by various forensic tools such as X-ways and EnCase. In the script discussed below, we are going to automate downloading lists of hashes from VirusShare to create a newline-delimited hash list.

For this script, we need a third party Python library tqdm which can be downloaded as follows −

pip install tqdm

Note that in this script, first we will read the VirusShare hashes page and dynamically identify the most recent hash list. Then we will initialize the progress bar and download the hash list in the desired range.

First, import the following libraries −

from __future__ import print_function

import argparse
import os
import ssl
import sys
import tqdm

from urllib.request import urlopen
import urllib.error

This script will take one positional argument, which would be the desired path for the hash set −

if __name__ == '__main__':
   parser = argparse.ArgumentParser('Hash set from VirusShare')
   parser.add_argument("OUTPUT_HASH", help = "Output Hashset")
   parser.add_argument("--start", type = int, help = "Optional starting location")
   args = parser.parse_args()

Now, we will perform the standard input validation as follows −

directory = os.path.dirname(args.OUTPUT_HASH)
if not os.path.exists(directory):
   os.makedirs(directory)
if args.start:
   main(args.OUTPUT_HASH, start=args.start)
else:
   main(args.OUTPUT_HASH)

Now we need to define main() function with **kwargs as an argument because this will create a dictionary we can refer to support supplied key arguments as shown below −

def main(hashset, **kwargs):
   url = "https://virusshare.com/hashes.4n6"
   print("[+] Identifying hash set range from {}".format(url))
   context = ssl._create_unverified_context()

Now, we need to open VirusShare hashes page by using urlib.request.urlopen() method. We will use try-except block as follows −

try:
   index = urlopen(url, context = context).read().decode("utf-8")
except urllib.error.HTTPError as e:
   print("[-] Error accessing webpage - exiting..")
   sys.exit(1)

Now, identify latest hash list from downloaded pages. You can do this by finding the last instance of the HTML href tag to VirusShare hash list. It can be done with the following lines of code −

tag = index.rfind(r'a href = "hashes/VirusShare_')
stop = int(index[tag + 27: tag + 27 + 5].lstrip("0"))

if "start" not in kwa<rgs:
   start = 0
else:
   start = kwargs["start"]

if start < 0 or start > stop:
   print("[-] Supplied start argument must be greater than or equal ""to zero but less than the latest hash list, ""currently: {}".format(stop))
sys.exit(2)
print("[+] Creating a hashset from hash lists {} to {}".format(start, stop))
hashes_downloaded = 0

Now, we will use tqdm.trange() method to create a loop and progress bar as follows −

for x in tqdm.trange(start, stop + 1, unit_scale=True,desc="Progress"):
   url_hash = "https://virusshare.com/hashes/VirusShare_"\"{}.md5".format(str(x).zfill(5))
   try:
      hashes = urlopen(url_hash, context=context).read().decode("utf-8")
      hashes_list = hashes.split("\n")
   except urllib.error.HTTPError as e:
      print("[-] Error accessing webpage for hash list {}"" - continuing..".format(x))
   continue

After performing the above steps successfully, we will open the hash set text file in a+ mode to append to the bottom of text file.

with open(hashset, "a+") as hashfile:
   for line in hashes_list:
   if not line.startswith("#") and line != "":
      hashes_downloaded += 1
      hashfile.write(line + '\n')
   print("[+] Finished downloading {} hashes into {}".format(
      hashes_downloaded, hashset))

After running the above script, you will get the latest hash list containing MD5 hash values in text format.

Previous Page Print Page