mirror of
https://codeberg.org/ral/web_archive_cli.git
synced 2024-08-16 09:59:49 +02:00
76 lines
1.1 KiB
Markdown
76 lines
1.1 KiB
Markdown
Web Archive CLI
|
|
===============
|
|
|
|
Simple Python CLI to archive whole websites to the Web Archive via a `sitemap.xml` file.
|
|
|
|
|
|
Installation
|
|
------------
|
|
|
|
Create a fresh Python virtual env:
|
|
|
|
```
|
|
python3 -m venv venv
|
|
```
|
|
|
|
and activate it:
|
|
|
|
```
|
|
. venv/bin/activate
|
|
```
|
|
|
|
and install the dependencies:
|
|
|
|
```
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
|
|
Usage
|
|
-----
|
|
|
|
Activate the Python virtual env:
|
|
|
|
```
|
|
. venv/bin/activate
|
|
```
|
|
|
|
Convert a `sitemap.xml` file to a plain list of URLs:
|
|
|
|
```
|
|
python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml
|
|
```
|
|
|
|
Push all URLs to the web archive:
|
|
|
|
```
|
|
python do_archive.py urls_example.org_2023-01-01.txt
|
|
```
|
|
|
|
Note: Strictly follow the scheme with url and date encoded into the file names.
|
|
|
|
|
|
Dependencies
|
|
------------
|
|
|
|
The archive script is based on the `savepagenow` Python package:
|
|
|
|
* https://pypi.org/project/savepagenow/
|
|
* https://github.com/palewire/savepagenow
|
|
|
|
|
|
To archive a single URL only, the `savepagenow` CLI can be used directly:
|
|
|
|
* https://palewi.re/docs/savepagenow/cli.html
|
|
|
|
|
|
Links
|
|
-----
|
|
|
|
Wayback API:
|
|
|
|
* https://archive.org/help/wayback_api.php
|
|
|
|
Manual paper feed:
|
|
|
|
* https://web.archive.org/save/
|