1
0
Fork 0
mirror of https://codeberg.org/ral/web_archive_cli.git synced 2024-08-16 09:59:49 +02:00
web_archive_cli/readme.md
2023-07-15 15:35:07 +02:00

76 lines
1.1 KiB
Markdown

Web Archive CLI
===============
Simple Python CLI to archive whole websites to the Web Archive via a `sitemap.xml` file.
Installation
------------
Create a fresh Python virtual env:
```
python3 -m venv venv
```
and activate it:
```
. venv/bin/activate
```
and install the dependencies:
```
pip install -r requirements.txt
```
Usage
-----
Activate the Python virtual env:
```
. venv/bin/activate
```
Convert a `sitemap.xml` file to a plain list of URLs:
```
python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml
```
Push all URLs to the web archive:
```
python do_archive.py urls_example.org_2023-01-01.txt
```
Note: Strictly follow the scheme with url and date encoded into the file names.
Dependencies
------------
The archive script is based on the `savepagenow` Python package:
* https://pypi.org/project/savepagenow/
* https://github.com/palewire/savepagenow
To archive a single URL only, the `savepagenow` CLI can be used directly:
* https://palewi.re/docs/savepagenow/cli.html
Links
-----
Wayback API:
* https://archive.org/help/wayback_api.php
Manual paper feed:
* https://web.archive.org/save/