1
0
Fork 0
mirror of https://codeberg.org/ral/web_archive_cli.git synced 2024-08-16 09:59:49 +02:00
Find a file
2023-07-15 15:35:07 +02:00
.gitignore Initial checkin 2023-07-15 15:30:35 +02:00
do_archive.py Initial checkin 2023-07-15 15:30:35 +02:00
readme.md Comment about file names 2023-07-15 15:35:07 +02:00
sitemap_to_urllist.py Initial checkin 2023-07-15 15:30:35 +02:00

Web Archive CLI

Simple Python CLI to archive whole websites to the Web Archive via a sitemap.xml file.

Installation

Create a fresh Python virtual env:

python3 -m venv venv

and activate it:

. venv/bin/activate

and install the dependencies:

pip install -r requirements.txt

Usage

Activate the Python virtual env:

. venv/bin/activate

Convert a sitemap.xml file to a plain list of URLs:

python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml

Push all URLs to the web archive:

python do_archive.py urls_example.org_2023-01-01.txt

Note: Strictly follow the scheme with url and date encoded into the file names.

Dependencies

The archive script is based on the savepagenow Python package:

To archive a single URL only, the savepagenow CLI can be used directly:

Wayback API:

Manual paper feed: