Web Archive CLI =============== Simple Python CLI to archive whole websites to the Web Archive via a `sitemap.xml` file. Installation ------------ Create a fresh Python virtual env: ``` python3 -m venv venv ``` and activate it: ``` . venv/bin/activate ``` and install the dependencies: ``` pip install -r requirements.txt ``` Usage ----- Activate the Python virtual env: ``` . venv/bin/activate ``` Convert a `sitemap.xml` file to a plain list of URLs: ``` python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml ``` Push all URLs to the web archive: ``` python do_archive.py urls_example.org_2023-01-01.txt ``` Note: Strictly follow the scheme with url and date encoded into the file names. Dependencies ------------ The archive script is based on the `savepagenow` Python package: * https://pypi.org/project/savepagenow/ * https://github.com/palewire/savepagenow To archive a single URL only, the `savepagenow` CLI can be used directly: * https://palewi.re/docs/savepagenow/cli.html Links ----- Wayback API: * https://archive.org/help/wayback_api.php Manual paper feed: * https://web.archive.org/save/