web_archive_cli/readme.md

Web Archive CLI
===============

Simple Python CLI to archive whole websites to the Web Archive via a `sitemap.xml` file.


Installation
------------

Create a fresh Python virtual env:

```
python3 -m venv venv
```

and activate it:

```
. venv/bin/activate
```

and install the dependencies:

```
pip install -r requirements.txt
```


Usage
-----

Activate the Python virtual env:

```
. venv/bin/activate
```

Convert a `sitemap.xml` file to a plain list of URLs:

```
python sitemap_to_urllist.py sitemap_example.org_2023-01-01.xml
```

Push all URLs to the web archive:

```
python do_archive.py urls_example.org_2023-01-01.txt
```

Note: Strictly follow the scheme with url and date encoded into the file names.


Dependencies
------------

The archive script is based on the `savepagenow` Python package:

* https://pypi.org/project/savepagenow/
* https://github.com/palewire/savepagenow


To archive a single URL only, the `savepagenow` CLI can be used directly:

* https://palewi.re/docs/savepagenow/cli.html


Links
-----

Wayback API:

* https://archive.org/help/wayback_api.php

Manual paper feed:

* https://web.archive.org/save/