mirror of
https://codeberg.org/ral/web_archive_cli.git
synced 2024-08-16 09:59:49 +02:00
.gitignore | ||
do_archive.py | ||
LICENSE | ||
readme.md | ||
sitemap_to_urllist.py |
Web Archive CLI
Simple Python CLI to archive whole websites to the Web Archive via a sitemap.xml
file.
Installation
Create a fresh Python virtual env:
python3 -m venv venv
and activate it:
. venv/bin/activate
and install the dependencies:
pip install -r requirements.txt
Usage
Activate the Python virtual env:
. venv/bin/activate
Convert a sitemap.xml
file to a plain list of URLs:
python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml
Push all URLs to the web archive:
python do_archive.py urls_example.org_2023-01-01.txt
Note: Strictly follow the scheme with url and date encoded into the file names.
Dependencies
The archive script is based on the savepagenow
Python package:
To archive a single URL only, the savepagenow
CLI can be used directly:
Links
Wayback API:
Manual paper feed: