Enhance docs

This commit is contained in:
Ali Parlakçı
2021-04-18 03:26:07 +03:00
committed by Ali Parlakci
parent e78ecd5626
commit 0d407d7a39
2 changed files with 76 additions and 31 deletions

View File

@@ -1,39 +1,48 @@
# Bulk Downloader for Reddit v2 \[BETA\]
[![Python Test](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml/badge.svg?branch=v2)](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface.
Some quick reference commands are:
```bash
python3 -m bdfr download --subreddit Python -L 10
python3 -m bdfr download --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
python3 -m bdfr download --subreddit 'Python, all, mindustry' -L 10 --make-hard-links
python3 -m bdfr archive --subreddit all --format yaml -L 500 --folder-scheme ''
```
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface. [List of currently supported sources](#list-of-currently-supported-sources)
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
## Installation
*Bulk Downloader for Reddit* **requires** Python 3.9.x and it is distributed via `pip`. Install it as such:
```bash
pip install bdfr
```
If you want to use the source code or make contributions, refer to [CONTRIBUTING](docs/CONTRIBUTING.md#preparing-the-environment-for-development)
## Usage
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
Many websites and links are supported for the downloader:
After installation, run the program from any directory as shown below:
```bash
python3 -m bdfr download
```
```bash
python3 -m bdfr archive
```
- Direct links (links leading to a file)
- Erome
- Gfycat
- Gif Delivery Network
- Imgur
- Reddit Galleries
- Reddit Text Posts
- Reddit Videos
- Redgifs
- YouTube
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
### Options
```bash
python3 -m bdfr download --subreddit Python -L 10
```
```bash
python3 -m bdfr download --user me --saved --authenticate -L 25 --file-scheme '{POSTID}'
```
```bash
python3 -m bdfr download --subreddit 'Python, all, mindustry' -L 10 --make-hard-links
```
```bash
python3 -m bdfr archive --subreddit all --format yaml -L 500 --folder-scheme ''
```
## Options
The following options are common between both the `archive` and `download` commands of the BDFR.
@@ -103,7 +112,7 @@ The following options are common between both the `archive` and `download` comma
- Increases the verbosity of the program
- Can be specified multiple times
#### Downloader Options
### Downloader Options
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
@@ -145,7 +154,7 @@ The following options apply only to the `download` command. This command downloa
- Can be specified multiple times
- Also accepts CSV subreddit names
#### Archiver Options
### Archiver Options
The following options are for the `archive` command specifically.
@@ -198,8 +207,7 @@ It is highly recommended that the file name scheme contain the parameter `{POSTI
## Configuration
The configuration files are, by default, stored in the configuration directory for the user. This differs depending on the OS that the BDFR is being run on. For Windows, this will be:
- `C:\Documents and Settings\<User>\Application Data\Local Settings\BDFR\bdfr` or
- `C:\Documents and Settings\<User>\Application Data\BDFR\bdfr`
- `C:\Users\<User>\AppData\Local\BDFR\bdfr`
On Mac OSX, this will be:
- `~/Library/Application Support/bdfr`.
@@ -223,7 +231,7 @@ All of these should not be modified unless you know what you're doing, as the de
Most of these values have to do with OAuth2 configuration and authorisation. The key `backup_log_count` however has to do with the log rollover. The logs in the configuration directory can be verbose and for long runs of the BDFR, can grow quite large. To combat this, the BDFR will overwrite previous logs. This value determines how many previous run logs will be kept. The default is 3, which means that the BDFR will keep at most three past logs plus the current one. Any runs past this will overwrite the oldest log file, called "rolling over". If you want more records of past runs, increase this number.
#### Rate Limiting
### Rate Limiting
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
@@ -231,6 +239,19 @@ To this end, the BDFR will sleep for a time before retrying the download, giving
The option `--max-wait-time` and the configuration option `max_wait_time` both specify the maximum time the BDFR will wait. If both are present, the command-line option takes precedence. For instance, the default is 120, so the BDFR will wait for 60 seconds, then 120 seconds, and then move one. **Note that this results in a total time of 180 seconds trying the same download**. If you wish to try to bypass the rate-limiting system on the remote site, increasing the maximum wait time may help. However, note that the actual wait times increase exponentially if the resource is not downloaded i.e. specifying a max value of 300 (5 minutes), can make the BDFR pause for 15 minutes on one submission, not 5, in the worst case.
## List of currently supported sources
- Direct links (links leading to a file)
- Erome
- Gfycat
- Gif Delivery Network
- Imgur
- Reddit Galleries
- Reddit Text Posts
- Reddit Videos
- Redgifs
- YouTube
## Contributing
If you wish to contribute, see [Contributing](docs/CONTRIBUTING.md) for more information.