51
README.md
51
README.md
@@ -27,16 +27,24 @@ If you want to use the source code or make contributions, refer to [CONTRIBUTING
|
||||
|
||||
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
|
||||
|
||||
There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
|
||||
There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.
|
||||
|
||||
Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.
|
||||
|
||||
After installation, run the program from any directory as shown below:
|
||||
|
||||
```bash
|
||||
python3 -m bdfr download
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m bdfr archive
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m bdfr clone
|
||||
```
|
||||
|
||||
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
|
||||
|
||||
```bash
|
||||
@@ -64,6 +72,10 @@ The following options are common between both the `archive` and `download` comma
|
||||
- `--config`
|
||||
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
|
||||
- See [Configuration Files](#configuration) for more details
|
||||
- `--disable-module`
|
||||
- Can be specified multiple times
|
||||
- Disables certain modules from being used
|
||||
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
|
||||
- `--log`
|
||||
- This allows one to specify the location of the logfile
|
||||
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
|
||||
@@ -124,6 +136,8 @@ The following options are common between both the `archive` and `download` comma
|
||||
- `-u, --user`
|
||||
- This specifies the user to scrape in concert with other options
|
||||
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
|
||||
- Can be specified multiple times for multiple users
|
||||
- If downloading a multireddit, only one user can be specified
|
||||
- `-v, --verbose`
|
||||
- Increases the verbosity of the program
|
||||
- Can be specified multiple times
|
||||
@@ -132,13 +146,6 @@ The following options are common between both the `archive` and `download` comma
|
||||
|
||||
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
|
||||
|
||||
- `--exclude-id`
|
||||
- This will skip the download of any submission with the ID provided
|
||||
- Can be specified multiple times
|
||||
- `--exclude-id-file`
|
||||
- This will skip the download of any submission with any of the IDs in the files provided
|
||||
- Can be specified multiple times
|
||||
- Format is one ID per line
|
||||
- `--make-hard-links`
|
||||
- This flag will create hard links to an existing file when a duplicate is downloaded
|
||||
- This will make the file appear in multiple directories while only taking the space of a single instance
|
||||
@@ -159,6 +166,13 @@ The following options apply only to the `download` command. This command downloa
|
||||
- Sets the scheme for folders
|
||||
- Default is `{SUBREDDIT}`
|
||||
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
|
||||
- `--exclude-id`
|
||||
- This will skip the download of any submission with the ID provided
|
||||
- Can be specified multiple times
|
||||
- `--exclude-id-file`
|
||||
- This will skip the download of any submission with any of the IDs in the files provided
|
||||
- Can be specified multiple times
|
||||
- Format is one ID per line
|
||||
- `--skip-domain`
|
||||
- This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
|
||||
- Can be specified multiple times
|
||||
@@ -183,6 +197,10 @@ The following options are for the `archive` command specifically.
|
||||
- `xml`
|
||||
- `yaml`
|
||||
|
||||
### Cloner Options
|
||||
|
||||
The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
|
||||
|
||||
## Authentication and Security
|
||||
|
||||
The BDFR uses OAuth2 authentication to connect to Reddit if authentication is required. This means that it is a secure, token-based system for making requests. This also means that the BDFR only has access to specific parts of the account authenticated, by default only saved posts, upvoted posts, and the identity of the authenticated account. Note that authentication is not required unless accessing private things like upvoted posts, saved posts, and private multireddits.
|
||||
@@ -253,6 +271,7 @@ The following keys are optional, and defaults will be used if they cannot be fou
|
||||
- `backup_log_count`
|
||||
- `max_wait_time`
|
||||
- `time_format`
|
||||
- `disabled_modules`
|
||||
|
||||
All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.
|
||||
|
||||
@@ -264,6 +283,22 @@ The option `time_format` will specify the format of the timestamp that replaces
|
||||
|
||||
The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.
|
||||
|
||||
#### Disabling Modules
|
||||
|
||||
The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the `--disable-module` option.
|
||||
|
||||
Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.
|
||||
|
||||
- `Direct`
|
||||
- `Erome`
|
||||
- `Gallery` (Reddit Image Galleries)
|
||||
- `Gfycat`
|
||||
- `Imgur`
|
||||
- `Redgifs`
|
||||
- `SelfPost` (Reddit Text Post)
|
||||
- `Youtube`
|
||||
- `YoutubeDlFallback`
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
|
||||
|
||||
Reference in New Issue
Block a user