Merge branch 'aliparlakci:master' into master
This commit is contained in:
2
.gitattributes
vendored
Normal file
2
.gitattributes
vendored
Normal file
@@ -0,0 +1,2 @@
|
||||
# Declare files that will always have CRLF line endings on checkout.
|
||||
*.ps1 text eol=crlf
|
||||
2
.github/ISSUE_TEMPLATE/bug_report.md
vendored
2
.github/ISSUE_TEMPLATE/bug_report.md
vendored
@@ -9,7 +9,7 @@ assignees: ''
|
||||
|
||||
- [ ] I am reporting a bug.
|
||||
- [ ] I am running the latest version of BDfR
|
||||
- [ ] I have read the [Opening an issue](../../README.md#configuration)
|
||||
- [ ] I have read the [Opening an issue](https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/master/docs/CONTRIBUTING.md#opening-an-issue)
|
||||
|
||||
## Description
|
||||
A clear and concise description of what the bug is.
|
||||
|
||||
6
.github/workflows/publish.yml
vendored
6
.github/workflows/publish.yml
vendored
@@ -27,3 +27,9 @@ jobs:
|
||||
run: |
|
||||
python setup.py sdist bdist_wheel
|
||||
twine upload dist/*
|
||||
|
||||
- name: Upload coverage report
|
||||
uses: actions/upload-artifact@v2
|
||||
with:
|
||||
name: dist
|
||||
path: dist/
|
||||
|
||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -139,3 +139,6 @@ cython_debug/
|
||||
|
||||
# Test configuration file
|
||||
test_config.cfg
|
||||
|
||||
.vscode/
|
||||
.idea/
|
||||
9
.gitmodules
vendored
Normal file
9
.gitmodules
vendored
Normal file
@@ -0,0 +1,9 @@
|
||||
[submodule "scripts/tests/bats"]
|
||||
path = scripts/tests/bats
|
||||
url = https://github.com/bats-core/bats-core.git
|
||||
[submodule "scripts/tests/test_helper/bats-assert"]
|
||||
path = scripts/tests/test_helper/bats-assert
|
||||
url = https://github.com/bats-core/bats-assert.git
|
||||
[submodule "scripts/tests/test_helper/bats-support"]
|
||||
path = scripts/tests/test_helper/bats-support
|
||||
url = https://github.com/bats-core/bats-support.git
|
||||
84
README.md
84
README.md
@@ -1,11 +1,14 @@
|
||||
# Bulk Downloader for Reddit
|
||||
[](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
|
||||
[](https://badge.fury.io/py/bdfr)
|
||||
[](https://pypi.python.org/pypi/bdfr)
|
||||
[](https://pypi.python.org/pypi/bdfr)
|
||||
[](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
|
||||
|
||||
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface. [List of currently supported sources](#list-of-currently-supported-sources)
|
||||
|
||||
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
|
||||
|
||||
Included in this README are a few example Bash tricks to get certain behaviour. For that, see [Common Command Tricks](#common-command-tricks).
|
||||
|
||||
## Installation
|
||||
*Bulk Downloader for Reddit* needs Python version 3.9 or above. Please update Python before installation to meet the requirement. Then, you can install it as such:
|
||||
```bash
|
||||
@@ -26,16 +29,24 @@ If you want to use the source code or make contributions, refer to [CONTRIBUTING
|
||||
|
||||
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
|
||||
|
||||
There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
|
||||
There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.
|
||||
|
||||
Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.
|
||||
|
||||
After installation, run the program from any directory as shown below:
|
||||
|
||||
```bash
|
||||
python3 -m bdfr download
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m bdfr archive
|
||||
```
|
||||
|
||||
```bash
|
||||
python3 -m bdfr clone
|
||||
```
|
||||
|
||||
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
|
||||
|
||||
```bash
|
||||
@@ -63,6 +74,17 @@ The following options are common between both the `archive` and `download` comma
|
||||
- `--config`
|
||||
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
|
||||
- See [Configuration Files](#configuration) for more details
|
||||
- `--disable-module`
|
||||
- Can be specified multiple times
|
||||
- Disables certain modules from being used
|
||||
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
|
||||
- `--ignore-user`
|
||||
- This will add a user to ignore
|
||||
- Can be specified multiple times
|
||||
- `--include-id-file`
|
||||
- This will add any submission with the IDs in the files provided
|
||||
- Can be specified multiple times
|
||||
- Format is one ID per line
|
||||
- `--log`
|
||||
- This allows one to specify the location of the logfile
|
||||
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
|
||||
@@ -123,6 +145,8 @@ The following options are common between both the `archive` and `download` comma
|
||||
- `-u, --user`
|
||||
- This specifies the user to scrape in concert with other options
|
||||
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
|
||||
- Can be specified multiple times for multiple users
|
||||
- If downloading a multireddit, only one user can be specified
|
||||
- `-v, --verbose`
|
||||
- Increases the verbosity of the program
|
||||
- Can be specified multiple times
|
||||
@@ -131,13 +155,6 @@ The following options are common between both the `archive` and `download` comma
|
||||
|
||||
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
|
||||
|
||||
- `--exclude-id`
|
||||
- This will skip the download of any submission with the ID provided
|
||||
- Can be specified multiple times
|
||||
- `--exclude-id-file`
|
||||
- This will skip the download of any submission with any of the IDs in the files provided
|
||||
- Can be specified multiple times
|
||||
- Format is one ID per line
|
||||
- `--make-hard-links`
|
||||
- This flag will create hard links to an existing file when a duplicate is downloaded
|
||||
- This will make the file appear in multiple directories while only taking the space of a single instance
|
||||
@@ -158,6 +175,13 @@ The following options apply only to the `download` command. This command downloa
|
||||
- Sets the scheme for folders
|
||||
- Default is `{SUBREDDIT}`
|
||||
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
|
||||
- `--exclude-id`
|
||||
- This will skip the download of any submission with the ID provided
|
||||
- Can be specified multiple times
|
||||
- `--exclude-id-file`
|
||||
- This will skip the download of any submission with any of the IDs in the files provided
|
||||
- Can be specified multiple times
|
||||
- Format is one ID per line
|
||||
- `--skip-domain`
|
||||
- This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
|
||||
- Can be specified multiple times
|
||||
@@ -181,6 +205,23 @@ The following options are for the `archive` command specifically.
|
||||
- `json` (default)
|
||||
- `xml`
|
||||
- `yaml`
|
||||
- `--comment-context`
|
||||
- This option will, instead of downloading an individual comment, download the submission that comment is a part of
|
||||
- May result in a longer run time as it retrieves much more data
|
||||
|
||||
### Cloner Options
|
||||
|
||||
The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
|
||||
|
||||
## Common Command Tricks
|
||||
|
||||
A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
|
||||
|
||||
```bash
|
||||
cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>
|
||||
```
|
||||
|
||||
The part `-L 50` is to make sure that the character limit for a single line isn't exceeded, but may not be necessary. This can also be used to load subreddits from a file, simply exchange `--user` with `--subreddit` and so on.
|
||||
|
||||
## Authentication and Security
|
||||
|
||||
@@ -252,6 +293,7 @@ The following keys are optional, and defaults will be used if they cannot be fou
|
||||
- `backup_log_count`
|
||||
- `max_wait_time`
|
||||
- `time_format`
|
||||
- `disabled_modules`
|
||||
|
||||
All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.
|
||||
|
||||
@@ -263,6 +305,22 @@ The option `time_format` will specify the format of the timestamp that replaces
|
||||
|
||||
The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.
|
||||
|
||||
#### Disabling Modules
|
||||
|
||||
The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the `--disable-module` option.
|
||||
|
||||
Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.
|
||||
|
||||
- `Direct`
|
||||
- `Erome`
|
||||
- `Gallery` (Reddit Image Galleries)
|
||||
- `Gfycat`
|
||||
- `Imgur`
|
||||
- `Redgifs`
|
||||
- `SelfPost` (Reddit Text Post)
|
||||
- `Youtube`
|
||||
- `YoutubeDlFallback`
|
||||
|
||||
### Rate Limiting
|
||||
|
||||
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
|
||||
@@ -277,10 +335,14 @@ The BDFR can be run in multiple instances with multiple configurations, either c
|
||||
|
||||
Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.
|
||||
|
||||
Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
|
||||
Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
|
||||
|
||||
The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.
|
||||
|
||||
## Manipulating Logfiles
|
||||
|
||||
The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this end, a number of bash scripts have been [included here](./scripts). They show examples for how to extract successfully downloaded IDs, failed IDs, and more besides.
|
||||
|
||||
## List of currently supported sources
|
||||
|
||||
- Direct links (links leading to a file)
|
||||
|
||||
@@ -6,6 +6,7 @@ import sys
|
||||
import click
|
||||
|
||||
from bdfr.archiver import Archiver
|
||||
from bdfr.cloner import RedditCloner
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.downloader import RedditDownloader
|
||||
|
||||
@@ -13,30 +14,55 @@ logger = logging.getLogger()
|
||||
|
||||
_common_options = [
|
||||
click.argument('directory', type=str),
|
||||
click.option('--config', type=str, default=None),
|
||||
click.option('-v', '--verbose', default=None, count=True),
|
||||
click.option('-l', '--link', multiple=True, default=None, type=str),
|
||||
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
|
||||
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
|
||||
click.option('-L', '--limit', default=None, type=int),
|
||||
click.option('--authenticate', is_flag=True, default=None),
|
||||
click.option('--config', type=str, default=None),
|
||||
click.option('--disable-module', multiple=True, default=None, type=str),
|
||||
click.option('--ignore-user', type=str, multiple=True, default=None),
|
||||
click.option('--include-id-file', multiple=True, default=None),
|
||||
click.option('--log', type=str, default=None),
|
||||
click.option('--submitted', is_flag=True, default=None),
|
||||
click.option('--upvoted', is_flag=True, default=None),
|
||||
click.option('--saved', is_flag=True, default=None),
|
||||
click.option('--search', default=None, type=str),
|
||||
click.option('--submitted', is_flag=True, default=None),
|
||||
click.option('--time-format', type=str, default=None),
|
||||
click.option('-u', '--user', type=str, default=None),
|
||||
click.option('--upvoted', is_flag=True, default=None),
|
||||
click.option('-L', '--limit', default=None, type=int),
|
||||
click.option('-l', '--link', multiple=True, default=None, type=str),
|
||||
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
|
||||
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new', 'controversial', 'rising', 'relevance')),
|
||||
default=None),
|
||||
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
|
||||
click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None),
|
||||
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new',
|
||||
'controversial', 'rising', 'relevance')), default=None),
|
||||
click.option('-u', '--user', type=str, multiple=True, default=None),
|
||||
click.option('-v', '--verbose', default=None, count=True),
|
||||
]
|
||||
|
||||
_downloader_options = [
|
||||
click.option('--file-scheme', default=None, type=str),
|
||||
click.option('--folder-scheme', default=None, type=str),
|
||||
click.option('--make-hard-links', is_flag=True, default=None),
|
||||
click.option('--max-wait-time', type=int, default=None),
|
||||
click.option('--no-dupes', is_flag=True, default=None),
|
||||
click.option('--search-existing', is_flag=True, default=None),
|
||||
click.option('--exclude-id', default=None, multiple=True),
|
||||
click.option('--exclude-id-file', default=None, multiple=True),
|
||||
click.option('--skip', default=None, multiple=True),
|
||||
click.option('--skip-domain', default=None, multiple=True),
|
||||
click.option('--skip-subreddit', default=None, multiple=True),
|
||||
]
|
||||
|
||||
_archiver_options = [
|
||||
click.option('--all-comments', is_flag=True, default=None),
|
||||
click.option('--comment-context', is_flag=True, default=None),
|
||||
click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None),
|
||||
]
|
||||
|
||||
|
||||
def _add_common_options(func):
|
||||
for opt in _common_options:
|
||||
func = opt(func)
|
||||
return func
|
||||
def _add_options(opts: list):
|
||||
def wrap(func):
|
||||
for opt in opts:
|
||||
func = opt(func)
|
||||
return func
|
||||
return wrap
|
||||
|
||||
|
||||
@click.group()
|
||||
@@ -45,18 +71,8 @@ def cli():
|
||||
|
||||
|
||||
@cli.command('download')
|
||||
@click.option('--exclude-id', default=None, multiple=True)
|
||||
@click.option('--exclude-id-file', default=None, multiple=True)
|
||||
@click.option('--file-scheme', default=None, type=str)
|
||||
@click.option('--folder-scheme', default=None, type=str)
|
||||
@click.option('--make-hard-links', is_flag=True, default=None)
|
||||
@click.option('--max-wait-time', type=int, default=None)
|
||||
@click.option('--no-dupes', is_flag=True, default=None)
|
||||
@click.option('--search-existing', is_flag=True, default=None)
|
||||
@click.option('--skip', default=None, multiple=True)
|
||||
@click.option('--skip-domain', default=None, multiple=True)
|
||||
@click.option('--skip-subreddit', default=None, multiple=True)
|
||||
@_add_common_options
|
||||
@_add_options(_common_options)
|
||||
@_add_options(_downloader_options)
|
||||
@click.pass_context
|
||||
def cli_download(context: click.Context, **_):
|
||||
config = Configuration()
|
||||
@@ -73,9 +89,8 @@ def cli_download(context: click.Context, **_):
|
||||
|
||||
|
||||
@cli.command('archive')
|
||||
@_add_common_options
|
||||
@click.option('--all-comments', is_flag=True, default=None)
|
||||
@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
|
||||
@_add_options(_common_options)
|
||||
@_add_options(_archiver_options)
|
||||
@click.pass_context
|
||||
def cli_archive(context: click.Context, **_):
|
||||
config = Configuration()
|
||||
@@ -85,7 +100,26 @@ def cli_archive(context: click.Context, **_):
|
||||
reddit_archiver = Archiver(config)
|
||||
reddit_archiver.download()
|
||||
except Exception:
|
||||
logger.exception('Downloader exited unexpectedly')
|
||||
logger.exception('Archiver exited unexpectedly')
|
||||
raise
|
||||
else:
|
||||
logger.info('Program complete')
|
||||
|
||||
|
||||
@cli.command('clone')
|
||||
@_add_options(_common_options)
|
||||
@_add_options(_archiver_options)
|
||||
@_add_options(_downloader_options)
|
||||
@click.pass_context
|
||||
def cli_clone(context: click.Context, **_):
|
||||
config = Configuration()
|
||||
config.process_click_arguments(context)
|
||||
setup_logging(config.verbose)
|
||||
try:
|
||||
reddit_scraper = RedditCloner(config)
|
||||
reddit_scraper.download()
|
||||
except Exception:
|
||||
logger.exception('Scraper exited unexpectedly')
|
||||
raise
|
||||
else:
|
||||
logger.info('Program complete')
|
||||
|
||||
@@ -22,10 +22,12 @@ class BaseArchiveEntry(ABC):
|
||||
'id': in_comment.id,
|
||||
'score': in_comment.score,
|
||||
'subreddit': in_comment.subreddit.display_name,
|
||||
'author_flair': in_comment.author_flair_text,
|
||||
'submission': in_comment.submission.id,
|
||||
'stickied': in_comment.stickied,
|
||||
'body': in_comment.body,
|
||||
'is_submitter': in_comment.is_submitter,
|
||||
'distinguished': in_comment.distinguished,
|
||||
'created_utc': in_comment.created_utc,
|
||||
'parent_id': in_comment.parent_id,
|
||||
'replies': [],
|
||||
|
||||
@@ -35,6 +35,10 @@ class SubmissionArchiveEntry(BaseArchiveEntry):
|
||||
'link_flair_text': self.source.link_flair_text,
|
||||
'num_comments': self.source.num_comments,
|
||||
'over_18': self.source.over_18,
|
||||
'spoiler': self.source.spoiler,
|
||||
'pinned': self.source.pinned,
|
||||
'locked': self.source.locked,
|
||||
'distinguished': self.source.distinguished,
|
||||
'created_utc': self.source.created_utc,
|
||||
}
|
||||
|
||||
|
||||
@@ -14,24 +14,30 @@ from bdfr.archive_entry.base_archive_entry import BaseArchiveEntry
|
||||
from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
|
||||
from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.downloader import RedditDownloader
|
||||
from bdfr.connector import RedditConnector
|
||||
from bdfr.exceptions import ArchiverError
|
||||
from bdfr.resource import Resource
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Archiver(RedditDownloader):
|
||||
class Archiver(RedditConnector):
|
||||
def __init__(self, args: Configuration):
|
||||
super(Archiver, self).__init__(args)
|
||||
|
||||
def download(self):
|
||||
for generator in self.reddit_lists:
|
||||
for submission in generator:
|
||||
if (submission.author and submission.author.name in self.args.ignore_user) or \
|
||||
(submission.author is None and 'DELETED' in self.args.ignore_user):
|
||||
logger.debug(
|
||||
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
|
||||
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
|
||||
continue
|
||||
logger.debug(f'Attempting to archive submission {submission.id}')
|
||||
self._write_entry(submission)
|
||||
self.write_entry(submission)
|
||||
|
||||
def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
|
||||
def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
|
||||
supplied_submissions = []
|
||||
for sub_id in self.args.link:
|
||||
if len(sub_id) == 6:
|
||||
@@ -42,12 +48,13 @@ class Archiver(RedditDownloader):
|
||||
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
|
||||
return [supplied_submissions]
|
||||
|
||||
def _get_user_data(self) -> list[Iterator]:
|
||||
results = super(Archiver, self)._get_user_data()
|
||||
def get_user_data(self) -> list[Iterator]:
|
||||
results = super(Archiver, self).get_user_data()
|
||||
if self.args.user and self.args.all_comments:
|
||||
sort = self._determine_sort_function()
|
||||
logger.debug(f'Retrieving comments of user {self.args.user}')
|
||||
results.append(sort(self.reddit_instance.redditor(self.args.user).comments, limit=self.args.limit))
|
||||
sort = self.determine_sort_function()
|
||||
for user in self.args.user:
|
||||
logger.debug(f'Retrieving comments of user {user}')
|
||||
results.append(sort(self.reddit_instance.redditor(user).comments, limit=self.args.limit))
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
@@ -59,7 +66,10 @@ class Archiver(RedditDownloader):
|
||||
else:
|
||||
raise ArchiverError(f'Factory failed to classify item of type {type(praw_item).__name__}')
|
||||
|
||||
def _write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
|
||||
def write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
|
||||
if self.args.comment_context and isinstance(praw_item, praw.models.Comment):
|
||||
logger.debug(f'Converting comment {praw_item.id} to submission {praw_item.submission.id}')
|
||||
praw_item = praw_item.submission
|
||||
archive_entry = self._pull_lever_entry_factory(praw_item)
|
||||
if self.args.format == 'json':
|
||||
self._write_entry_json(archive_entry)
|
||||
@@ -72,17 +82,17 @@ class Archiver(RedditDownloader):
|
||||
logger.info(f'Record for entry item {praw_item.id} written to disk')
|
||||
|
||||
def _write_entry_json(self, entry: BaseArchiveEntry):
|
||||
resource = Resource(entry.source, '', '.json')
|
||||
resource = Resource(entry.source, '', lambda: None, '.json')
|
||||
content = json.dumps(entry.compile())
|
||||
self._write_content_to_disk(resource, content)
|
||||
|
||||
def _write_entry_xml(self, entry: BaseArchiveEntry):
|
||||
resource = Resource(entry.source, '', '.xml')
|
||||
resource = Resource(entry.source, '', lambda: None, '.xml')
|
||||
content = dict2xml.dict2xml(entry.compile(), wrap='root')
|
||||
self._write_content_to_disk(resource, content)
|
||||
|
||||
def _write_entry_yaml(self, entry: BaseArchiveEntry):
|
||||
resource = Resource(entry.source, '', '.yaml')
|
||||
resource = Resource(entry.source, '', lambda: None, '.yaml')
|
||||
content = yaml.dump(entry.compile())
|
||||
self._write_content_to_disk(resource, content)
|
||||
|
||||
|
||||
21
bdfr/cloner.py
Normal file
21
bdfr/cloner.py
Normal file
@@ -0,0 +1,21 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import logging
|
||||
|
||||
from bdfr.archiver import Archiver
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.downloader import RedditDownloader
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class RedditCloner(RedditDownloader, Archiver):
|
||||
def __init__(self, args: Configuration):
|
||||
super(RedditCloner, self).__init__(args)
|
||||
|
||||
def download(self):
|
||||
for generator in self.reddit_lists:
|
||||
for submission in generator:
|
||||
self._download_submission(submission)
|
||||
self.write_entry(submission)
|
||||
@@ -13,19 +13,23 @@ class Configuration(Namespace):
|
||||
self.authenticate = False
|
||||
self.config = None
|
||||
self.directory: str = '.'
|
||||
self.disable_module: list[str] = []
|
||||
self.exclude_id = []
|
||||
self.exclude_id_file = []
|
||||
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
|
||||
self.folder_scheme: str = '{SUBREDDIT}'
|
||||
self.ignore_user = []
|
||||
self.include_id_file = []
|
||||
self.limit: Optional[int] = None
|
||||
self.link: list[str] = []
|
||||
self.log: Optional[str] = None
|
||||
self.make_hard_links = False
|
||||
self.max_wait_time = None
|
||||
self.multireddit: list[str] = []
|
||||
self.no_dupes: bool = False
|
||||
self.saved: bool = False
|
||||
self.search: Optional[str] = None
|
||||
self.search_existing: bool = False
|
||||
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
|
||||
self.folder_scheme: str = '{SUBREDDIT}'
|
||||
self.skip: list[str] = []
|
||||
self.skip_domain: list[str] = []
|
||||
self.skip_subreddit: list[str] = []
|
||||
@@ -35,13 +39,13 @@ class Configuration(Namespace):
|
||||
self.time: str = 'all'
|
||||
self.time_format = None
|
||||
self.upvoted: bool = False
|
||||
self.user: Optional[str] = None
|
||||
self.user: list[str] = []
|
||||
self.verbose: int = 0
|
||||
self.make_hard_links = False
|
||||
|
||||
# Archiver-specific options
|
||||
self.format = 'json'
|
||||
self.all_comments = False
|
||||
self.format = 'json'
|
||||
self.comment_context: bool = False
|
||||
|
||||
def process_click_arguments(self, context: click.Context):
|
||||
for arg_key in context.params.keys():
|
||||
|
||||
424
bdfr/connector.py
Normal file
424
bdfr/connector.py
Normal file
@@ -0,0 +1,424 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import configparser
|
||||
import importlib.resources
|
||||
import itertools
|
||||
import logging
|
||||
import logging.handlers
|
||||
import re
|
||||
import shutil
|
||||
import socket
|
||||
from abc import ABCMeta, abstractmethod
|
||||
from datetime import datetime
|
||||
from enum import Enum, auto
|
||||
from pathlib import Path
|
||||
from typing import Callable, Iterator
|
||||
|
||||
import appdirs
|
||||
import praw
|
||||
import praw.exceptions
|
||||
import praw.models
|
||||
import prawcore
|
||||
|
||||
from bdfr import exceptions as errors
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.download_filter import DownloadFilter
|
||||
from bdfr.file_name_formatter import FileNameFormatter
|
||||
from bdfr.oauth2 import OAuth2Authenticator, OAuth2TokenManager
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class RedditTypes:
|
||||
class SortType(Enum):
|
||||
CONTROVERSIAL = auto()
|
||||
HOT = auto()
|
||||
NEW = auto()
|
||||
RELEVENCE = auto()
|
||||
RISING = auto()
|
||||
TOP = auto()
|
||||
|
||||
class TimeType(Enum):
|
||||
ALL = 'all'
|
||||
DAY = 'day'
|
||||
HOUR = 'hour'
|
||||
MONTH = 'month'
|
||||
WEEK = 'week'
|
||||
YEAR = 'year'
|
||||
|
||||
|
||||
class RedditConnector(metaclass=ABCMeta):
|
||||
def __init__(self, args: Configuration):
|
||||
self.args = args
|
||||
self.config_directories = appdirs.AppDirs('bdfr', 'BDFR')
|
||||
self.run_time = datetime.now().isoformat()
|
||||
self._setup_internal_objects()
|
||||
|
||||
self.reddit_lists = self.retrieve_reddit_lists()
|
||||
|
||||
def _setup_internal_objects(self):
|
||||
self.determine_directories()
|
||||
self.load_config()
|
||||
self.create_file_logger()
|
||||
|
||||
self.read_config()
|
||||
|
||||
self.parse_disabled_modules()
|
||||
|
||||
self.download_filter = self.create_download_filter()
|
||||
logger.log(9, 'Created download filter')
|
||||
self.time_filter = self.create_time_filter()
|
||||
logger.log(9, 'Created time filter')
|
||||
self.sort_filter = self.create_sort_filter()
|
||||
logger.log(9, 'Created sort filter')
|
||||
self.file_name_formatter = self.create_file_name_formatter()
|
||||
logger.log(9, 'Create file name formatter')
|
||||
|
||||
self.create_reddit_instance()
|
||||
self.args.user = list(filter(None, [self.resolve_user_name(user) for user in self.args.user]))
|
||||
|
||||
self.excluded_submission_ids = set.union(
|
||||
self.read_id_files(self.args.exclude_id_file),
|
||||
set(self.args.exclude_id),
|
||||
)
|
||||
|
||||
self.args.link = list(itertools.chain(self.args.link, self.read_id_files(self.args.include_id_file)))
|
||||
|
||||
self.master_hash_list = {}
|
||||
self.authenticator = self.create_authenticator()
|
||||
logger.log(9, 'Created site authenticator')
|
||||
|
||||
self.args.skip_subreddit = self.split_args_input(self.args.skip_subreddit)
|
||||
self.args.skip_subreddit = set([sub.lower() for sub in self.args.skip_subreddit])
|
||||
|
||||
def read_config(self):
|
||||
"""Read any cfg values that need to be processed"""
|
||||
if self.args.max_wait_time is None:
|
||||
self.args.max_wait_time = self.cfg_parser.getint('DEFAULT', 'max_wait_time', fallback=120)
|
||||
logger.debug(f'Setting maximum download wait time to {self.args.max_wait_time} seconds')
|
||||
if self.args.time_format is None:
|
||||
option = self.cfg_parser.get('DEFAULT', 'time_format', fallback='ISO')
|
||||
if re.match(r'^[\s\'\"]*$', option):
|
||||
option = 'ISO'
|
||||
logger.debug(f'Setting datetime format string to {option}')
|
||||
self.args.time_format = option
|
||||
if not self.args.disable_module:
|
||||
self.args.disable_module = [self.cfg_parser.get('DEFAULT', 'disabled_modules', fallback='')]
|
||||
# Update config on disk
|
||||
with open(self.config_location, 'w') as file:
|
||||
self.cfg_parser.write(file)
|
||||
|
||||
def parse_disabled_modules(self):
|
||||
disabled_modules = self.args.disable_module
|
||||
disabled_modules = self.split_args_input(disabled_modules)
|
||||
disabled_modules = set([name.strip().lower() for name in disabled_modules])
|
||||
self.args.disable_module = disabled_modules
|
||||
logger.debug(f'Disabling the following modules: {", ".join(self.args.disable_module)}')
|
||||
|
||||
def create_reddit_instance(self):
|
||||
if self.args.authenticate:
|
||||
logger.debug('Using authenticated Reddit instance')
|
||||
if not self.cfg_parser.has_option('DEFAULT', 'user_token'):
|
||||
logger.log(9, 'Commencing OAuth2 authentication')
|
||||
scopes = self.cfg_parser.get('DEFAULT', 'scopes', fallback='identity, history, read, save')
|
||||
scopes = OAuth2Authenticator.split_scopes(scopes)
|
||||
oauth2_authenticator = OAuth2Authenticator(
|
||||
scopes,
|
||||
self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
)
|
||||
token = oauth2_authenticator.retrieve_new_token()
|
||||
self.cfg_parser['DEFAULT']['user_token'] = token
|
||||
with open(self.config_location, 'w') as file:
|
||||
self.cfg_parser.write(file, True)
|
||||
token_manager = OAuth2TokenManager(self.cfg_parser, self.config_location)
|
||||
|
||||
self.authenticated = True
|
||||
self.reddit_instance = praw.Reddit(
|
||||
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
user_agent=socket.gethostname(),
|
||||
token_manager=token_manager,
|
||||
)
|
||||
else:
|
||||
logger.debug('Using unauthenticated Reddit instance')
|
||||
self.authenticated = False
|
||||
self.reddit_instance = praw.Reddit(
|
||||
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
user_agent=socket.gethostname(),
|
||||
)
|
||||
|
||||
def retrieve_reddit_lists(self) -> list[praw.models.ListingGenerator]:
|
||||
master_list = []
|
||||
master_list.extend(self.get_subreddits())
|
||||
logger.log(9, 'Retrieved subreddits')
|
||||
master_list.extend(self.get_multireddits())
|
||||
logger.log(9, 'Retrieved multireddits')
|
||||
master_list.extend(self.get_user_data())
|
||||
logger.log(9, 'Retrieved user data')
|
||||
master_list.extend(self.get_submissions_from_link())
|
||||
logger.log(9, 'Retrieved submissions for given links')
|
||||
return master_list
|
||||
|
||||
def determine_directories(self):
|
||||
self.download_directory = Path(self.args.directory).resolve().expanduser()
|
||||
self.config_directory = Path(self.config_directories.user_config_dir)
|
||||
|
||||
self.download_directory.mkdir(exist_ok=True, parents=True)
|
||||
self.config_directory.mkdir(exist_ok=True, parents=True)
|
||||
|
||||
def load_config(self):
|
||||
self.cfg_parser = configparser.ConfigParser()
|
||||
if self.args.config:
|
||||
if (cfg_path := Path(self.args.config)).exists():
|
||||
self.cfg_parser.read(cfg_path)
|
||||
self.config_location = cfg_path
|
||||
return
|
||||
possible_paths = [
|
||||
Path('./config.cfg'),
|
||||
Path('./default_config.cfg'),
|
||||
Path(self.config_directory, 'config.cfg'),
|
||||
Path(self.config_directory, 'default_config.cfg'),
|
||||
]
|
||||
self.config_location = None
|
||||
for path in possible_paths:
|
||||
if path.resolve().expanduser().exists():
|
||||
self.config_location = path
|
||||
logger.debug(f'Loading configuration from {path}')
|
||||
break
|
||||
if not self.config_location:
|
||||
with importlib.resources.path('bdfr', 'default_config.cfg') as path:
|
||||
self.config_location = path
|
||||
shutil.copy(self.config_location, Path(self.config_directory, 'default_config.cfg'))
|
||||
if not self.config_location:
|
||||
raise errors.BulkDownloaderException('Could not find a configuration file to load')
|
||||
self.cfg_parser.read(self.config_location)
|
||||
|
||||
def create_file_logger(self):
|
||||
main_logger = logging.getLogger()
|
||||
if self.args.log is None:
|
||||
log_path = Path(self.config_directory, 'log_output.txt')
|
||||
else:
|
||||
log_path = Path(self.args.log).resolve().expanduser()
|
||||
if not log_path.parent.exists():
|
||||
raise errors.BulkDownloaderException(f'Designated location for logfile does not exist')
|
||||
backup_count = self.cfg_parser.getint('DEFAULT', 'backup_log_count', fallback=3)
|
||||
file_handler = logging.handlers.RotatingFileHandler(
|
||||
log_path,
|
||||
mode='a',
|
||||
backupCount=backup_count,
|
||||
)
|
||||
if log_path.exists():
|
||||
try:
|
||||
file_handler.doRollover()
|
||||
except PermissionError:
|
||||
logger.critical(
|
||||
'Cannot rollover logfile, make sure this is the only '
|
||||
'BDFR process or specify alternate logfile location')
|
||||
raise
|
||||
formatter = logging.Formatter('[%(asctime)s - %(name)s - %(levelname)s] - %(message)s')
|
||||
file_handler.setFormatter(formatter)
|
||||
file_handler.setLevel(0)
|
||||
|
||||
main_logger.addHandler(file_handler)
|
||||
|
||||
@staticmethod
|
||||
def sanitise_subreddit_name(subreddit: str) -> str:
|
||||
pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)/?$')
|
||||
match = re.match(pattern, subreddit)
|
||||
if not match:
|
||||
raise errors.BulkDownloaderException(f'Could not find subreddit name in string {subreddit}')
|
||||
return match.group(1)
|
||||
|
||||
@staticmethod
|
||||
def split_args_input(entries: list[str]) -> set[str]:
|
||||
all_entries = []
|
||||
split_pattern = re.compile(r'[,;]\s?')
|
||||
for entry in entries:
|
||||
results = re.split(split_pattern, entry)
|
||||
all_entries.extend([RedditConnector.sanitise_subreddit_name(name) for name in results])
|
||||
return set(all_entries)
|
||||
|
||||
def get_subreddits(self) -> list[praw.models.ListingGenerator]:
|
||||
if self.args.subreddit:
|
||||
out = []
|
||||
for reddit in self.split_args_input(self.args.subreddit):
|
||||
if reddit == 'friends' and self.authenticated is False:
|
||||
logger.error('Cannot read friends subreddit without an authenticated instance')
|
||||
continue
|
||||
try:
|
||||
reddit = self.reddit_instance.subreddit(reddit)
|
||||
try:
|
||||
self.check_subreddit_status(reddit)
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(e)
|
||||
continue
|
||||
if self.args.search:
|
||||
out.append(reddit.search(
|
||||
self.args.search,
|
||||
sort=self.sort_filter.name.lower(),
|
||||
limit=self.args.limit,
|
||||
time_filter=self.time_filter.value,
|
||||
))
|
||||
logger.debug(
|
||||
f'Added submissions from subreddit {reddit} with the search term "{self.args.search}"')
|
||||
else:
|
||||
out.append(self.create_filtered_listing_generator(reddit))
|
||||
logger.debug(f'Added submissions from subreddit {reddit}')
|
||||
except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
|
||||
logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
|
||||
return out
|
||||
else:
|
||||
return []
|
||||
|
||||
def resolve_user_name(self, in_name: str) -> str:
|
||||
if in_name == 'me':
|
||||
if self.authenticated:
|
||||
resolved_name = self.reddit_instance.user.me().name
|
||||
logger.log(9, f'Resolved user to {resolved_name}')
|
||||
return resolved_name
|
||||
else:
|
||||
logger.warning('To use "me" as a user, an authenticated Reddit instance must be used')
|
||||
else:
|
||||
return in_name
|
||||
|
||||
def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
|
||||
supplied_submissions = []
|
||||
for sub_id in self.args.link:
|
||||
if len(sub_id) == 6:
|
||||
supplied_submissions.append(self.reddit_instance.submission(id=sub_id))
|
||||
else:
|
||||
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
|
||||
return [supplied_submissions]
|
||||
|
||||
def determine_sort_function(self) -> Callable:
|
||||
if self.sort_filter is RedditTypes.SortType.NEW:
|
||||
sort_function = praw.models.Subreddit.new
|
||||
elif self.sort_filter is RedditTypes.SortType.RISING:
|
||||
sort_function = praw.models.Subreddit.rising
|
||||
elif self.sort_filter is RedditTypes.SortType.CONTROVERSIAL:
|
||||
sort_function = praw.models.Subreddit.controversial
|
||||
elif self.sort_filter is RedditTypes.SortType.TOP:
|
||||
sort_function = praw.models.Subreddit.top
|
||||
else:
|
||||
sort_function = praw.models.Subreddit.hot
|
||||
return sort_function
|
||||
|
||||
def get_multireddits(self) -> list[Iterator]:
|
||||
if self.args.multireddit:
|
||||
if len(self.args.user) != 1:
|
||||
logger.error(f'Only 1 user can be supplied when retrieving from multireddits')
|
||||
return []
|
||||
out = []
|
||||
for multi in self.split_args_input(self.args.multireddit):
|
||||
try:
|
||||
multi = self.reddit_instance.multireddit(self.args.user[0], multi)
|
||||
if not multi.subreddits:
|
||||
raise errors.BulkDownloaderException
|
||||
out.append(self.create_filtered_listing_generator(multi))
|
||||
logger.debug(f'Added submissions from multireddit {multi}')
|
||||
except (errors.BulkDownloaderException, praw.exceptions.PRAWException, prawcore.PrawcoreException) as e:
|
||||
logger.error(f'Failed to get submissions for multireddit {multi}: {e}')
|
||||
return out
|
||||
else:
|
||||
return []
|
||||
|
||||
def create_filtered_listing_generator(self, reddit_source) -> Iterator:
|
||||
sort_function = self.determine_sort_function()
|
||||
if self.sort_filter in (RedditTypes.SortType.TOP, RedditTypes.SortType.CONTROVERSIAL):
|
||||
return sort_function(reddit_source, limit=self.args.limit, time_filter=self.time_filter.value)
|
||||
else:
|
||||
return sort_function(reddit_source, limit=self.args.limit)
|
||||
|
||||
def get_user_data(self) -> list[Iterator]:
|
||||
if any([self.args.submitted, self.args.upvoted, self.args.saved]):
|
||||
if not self.args.user:
|
||||
logger.warning('At least one user must be supplied to download user data')
|
||||
return []
|
||||
generators = []
|
||||
for user in self.args.user:
|
||||
try:
|
||||
self.check_user_existence(user)
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(e)
|
||||
continue
|
||||
if self.args.submitted:
|
||||
logger.debug(f'Retrieving submitted posts of user {self.args.user}')
|
||||
generators.append(self.create_filtered_listing_generator(
|
||||
self.reddit_instance.redditor(user).submissions,
|
||||
))
|
||||
if not self.authenticated and any((self.args.upvoted, self.args.saved)):
|
||||
logger.warning('Accessing user lists requires authentication')
|
||||
else:
|
||||
if self.args.upvoted:
|
||||
logger.debug(f'Retrieving upvoted posts of user {self.args.user}')
|
||||
generators.append(self.reddit_instance.redditor(user).upvoted(limit=self.args.limit))
|
||||
if self.args.saved:
|
||||
logger.debug(f'Retrieving saved posts of user {self.args.user}')
|
||||
generators.append(self.reddit_instance.redditor(user).saved(limit=self.args.limit))
|
||||
return generators
|
||||
else:
|
||||
return []
|
||||
|
||||
def check_user_existence(self, name: str):
|
||||
user = self.reddit_instance.redditor(name=name)
|
||||
try:
|
||||
if user.id:
|
||||
return
|
||||
except prawcore.exceptions.NotFound:
|
||||
raise errors.BulkDownloaderException(f'Could not find user {name}')
|
||||
except AttributeError:
|
||||
if hasattr(user, 'is_suspended'):
|
||||
raise errors.BulkDownloaderException(f'User {name} is banned')
|
||||
|
||||
def create_file_name_formatter(self) -> FileNameFormatter:
|
||||
return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme, self.args.time_format)
|
||||
|
||||
def create_time_filter(self) -> RedditTypes.TimeType:
|
||||
try:
|
||||
return RedditTypes.TimeType[self.args.time.upper()]
|
||||
except (KeyError, AttributeError):
|
||||
return RedditTypes.TimeType.ALL
|
||||
|
||||
def create_sort_filter(self) -> RedditTypes.SortType:
|
||||
try:
|
||||
return RedditTypes.SortType[self.args.sort.upper()]
|
||||
except (KeyError, AttributeError):
|
||||
return RedditTypes.SortType.HOT
|
||||
|
||||
def create_download_filter(self) -> DownloadFilter:
|
||||
return DownloadFilter(self.args.skip, self.args.skip_domain)
|
||||
|
||||
def create_authenticator(self) -> SiteAuthenticator:
|
||||
return SiteAuthenticator(self.cfg_parser)
|
||||
|
||||
@abstractmethod
|
||||
def download(self):
|
||||
pass
|
||||
|
||||
@staticmethod
|
||||
def check_subreddit_status(subreddit: praw.models.Subreddit):
|
||||
if subreddit.display_name in ('all', 'friends'):
|
||||
return
|
||||
try:
|
||||
assert subreddit.id
|
||||
except prawcore.NotFound:
|
||||
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
|
||||
except prawcore.Forbidden:
|
||||
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')
|
||||
|
||||
@staticmethod
|
||||
def read_id_files(file_locations: list[str]) -> set[str]:
|
||||
out = []
|
||||
for id_file in file_locations:
|
||||
id_file = Path(id_file).resolve().expanduser()
|
||||
if not id_file.exists():
|
||||
logger.warning(f'ID file at {id_file} does not exist')
|
||||
continue
|
||||
with open(id_file, 'r') as file:
|
||||
for line in file:
|
||||
out.append(line.strip())
|
||||
return set(out)
|
||||
@@ -1,405 +1,70 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import configparser
|
||||
import hashlib
|
||||
import importlib.resources
|
||||
import logging
|
||||
import logging.handlers
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import socket
|
||||
import time
|
||||
from datetime import datetime
|
||||
from enum import Enum, auto
|
||||
from multiprocessing import Pool
|
||||
from pathlib import Path
|
||||
from typing import Callable, Iterator
|
||||
|
||||
import appdirs
|
||||
import praw
|
||||
import praw.exceptions
|
||||
import praw.models
|
||||
import prawcore
|
||||
|
||||
import bdfr.exceptions as errors
|
||||
from bdfr import exceptions as errors
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.download_filter import DownloadFilter
|
||||
from bdfr.file_name_formatter import FileNameFormatter
|
||||
from bdfr.oauth2 import OAuth2Authenticator, OAuth2TokenManager
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.connector import RedditConnector
|
||||
from bdfr.site_downloaders.download_factory import DownloadFactory
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _calc_hash(existing_file: Path):
|
||||
chunk_size = 1024 * 1024
|
||||
md5_hash = hashlib.md5()
|
||||
with open(existing_file, 'rb') as file:
|
||||
file_hash = hashlib.md5(file.read()).hexdigest()
|
||||
return existing_file, file_hash
|
||||
chunk = file.read(chunk_size)
|
||||
while chunk:
|
||||
md5_hash.update(chunk)
|
||||
chunk = file.read(chunk_size)
|
||||
file_hash = md5_hash.hexdigest()
|
||||
return existing_file, file_hash
|
||||
|
||||
|
||||
class RedditTypes:
|
||||
class SortType(Enum):
|
||||
CONTROVERSIAL = auto()
|
||||
HOT = auto()
|
||||
NEW = auto()
|
||||
RELEVENCE = auto()
|
||||
RISING = auto()
|
||||
TOP = auto()
|
||||
|
||||
class TimeType(Enum):
|
||||
ALL = 'all'
|
||||
DAY = 'day'
|
||||
HOUR = 'hour'
|
||||
MONTH = 'month'
|
||||
WEEK = 'week'
|
||||
YEAR = 'year'
|
||||
|
||||
|
||||
class RedditDownloader:
|
||||
class RedditDownloader(RedditConnector):
|
||||
def __init__(self, args: Configuration):
|
||||
self.args = args
|
||||
self.config_directories = appdirs.AppDirs('bdfr', 'BDFR')
|
||||
self.run_time = datetime.now().isoformat()
|
||||
self._setup_internal_objects()
|
||||
|
||||
self.reddit_lists = self._retrieve_reddit_lists()
|
||||
|
||||
def _setup_internal_objects(self):
|
||||
self._determine_directories()
|
||||
self._load_config()
|
||||
self._create_file_logger()
|
||||
|
||||
self._read_config()
|
||||
|
||||
self.download_filter = self._create_download_filter()
|
||||
logger.log(9, 'Created download filter')
|
||||
self.time_filter = self._create_time_filter()
|
||||
logger.log(9, 'Created time filter')
|
||||
self.sort_filter = self._create_sort_filter()
|
||||
logger.log(9, 'Created sort filter')
|
||||
self.file_name_formatter = self._create_file_name_formatter()
|
||||
logger.log(9, 'Create file name formatter')
|
||||
|
||||
self._create_reddit_instance()
|
||||
self._resolve_user_name()
|
||||
|
||||
self.excluded_submission_ids = self._read_excluded_ids()
|
||||
|
||||
super(RedditDownloader, self).__init__(args)
|
||||
if self.args.search_existing:
|
||||
self.master_hash_list = self.scan_existing_files(self.download_directory)
|
||||
else:
|
||||
self.master_hash_list = {}
|
||||
self.authenticator = self._create_authenticator()
|
||||
logger.log(9, 'Created site authenticator')
|
||||
|
||||
self.args.skip_subreddit = self._split_args_input(self.args.skip_subreddit)
|
||||
self.args.skip_subreddit = set([sub.lower() for sub in self.args.skip_subreddit])
|
||||
|
||||
def _read_config(self):
|
||||
"""Read any cfg values that need to be processed"""
|
||||
if self.args.max_wait_time is None:
|
||||
if not self.cfg_parser.has_option('DEFAULT', 'max_wait_time'):
|
||||
self.cfg_parser.set('DEFAULT', 'max_wait_time', '120')
|
||||
logger.log(9, 'Wrote default download wait time download to config file')
|
||||
self.args.max_wait_time = self.cfg_parser.getint('DEFAULT', 'max_wait_time')
|
||||
logger.debug(f'Setting maximum download wait time to {self.args.max_wait_time} seconds')
|
||||
if self.args.time_format is None:
|
||||
option = self.cfg_parser.get('DEFAULT', 'time_format', fallback='ISO')
|
||||
if re.match(r'^[ \'\"]*$', option):
|
||||
option = 'ISO'
|
||||
logger.debug(f'Setting datetime format string to {option}')
|
||||
self.args.time_format = option
|
||||
# Update config on disk
|
||||
with open(self.config_location, 'w') as file:
|
||||
self.cfg_parser.write(file)
|
||||
|
||||
def _create_reddit_instance(self):
|
||||
if self.args.authenticate:
|
||||
logger.debug('Using authenticated Reddit instance')
|
||||
if not self.cfg_parser.has_option('DEFAULT', 'user_token'):
|
||||
logger.log(9, 'Commencing OAuth2 authentication')
|
||||
scopes = self.cfg_parser.get('DEFAULT', 'scopes')
|
||||
scopes = OAuth2Authenticator.split_scopes(scopes)
|
||||
oauth2_authenticator = OAuth2Authenticator(
|
||||
scopes,
|
||||
self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
)
|
||||
token = oauth2_authenticator.retrieve_new_token()
|
||||
self.cfg_parser['DEFAULT']['user_token'] = token
|
||||
with open(self.config_location, 'w') as file:
|
||||
self.cfg_parser.write(file, True)
|
||||
token_manager = OAuth2TokenManager(self.cfg_parser, self.config_location)
|
||||
|
||||
self.authenticated = True
|
||||
self.reddit_instance = praw.Reddit(
|
||||
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
user_agent=socket.gethostname(),
|
||||
token_manager=token_manager,
|
||||
)
|
||||
else:
|
||||
logger.debug('Using unauthenticated Reddit instance')
|
||||
self.authenticated = False
|
||||
self.reddit_instance = praw.Reddit(
|
||||
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
|
||||
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
|
||||
user_agent=socket.gethostname(),
|
||||
)
|
||||
|
||||
def _retrieve_reddit_lists(self) -> list[praw.models.ListingGenerator]:
|
||||
master_list = []
|
||||
master_list.extend(self._get_subreddits())
|
||||
logger.log(9, 'Retrieved subreddits')
|
||||
master_list.extend(self._get_multireddits())
|
||||
logger.log(9, 'Retrieved multireddits')
|
||||
master_list.extend(self._get_user_data())
|
||||
logger.log(9, 'Retrieved user data')
|
||||
master_list.extend(self._get_submissions_from_link())
|
||||
logger.log(9, 'Retrieved submissions for given links')
|
||||
return master_list
|
||||
|
||||
def _determine_directories(self):
|
||||
self.download_directory = Path(self.args.directory).resolve().expanduser()
|
||||
self.config_directory = Path(self.config_directories.user_config_dir)
|
||||
|
||||
self.download_directory.mkdir(exist_ok=True, parents=True)
|
||||
self.config_directory.mkdir(exist_ok=True, parents=True)
|
||||
|
||||
def _load_config(self):
|
||||
self.cfg_parser = configparser.ConfigParser()
|
||||
if self.args.config:
|
||||
if (cfg_path := Path(self.args.config)).exists():
|
||||
self.cfg_parser.read(cfg_path)
|
||||
self.config_location = cfg_path
|
||||
return
|
||||
possible_paths = [
|
||||
Path('./config.cfg'),
|
||||
Path('./default_config.cfg'),
|
||||
Path(self.config_directory, 'config.cfg'),
|
||||
Path(self.config_directory, 'default_config.cfg'),
|
||||
]
|
||||
self.config_location = None
|
||||
for path in possible_paths:
|
||||
if path.resolve().expanduser().exists():
|
||||
self.config_location = path
|
||||
logger.debug(f'Loading configuration from {path}')
|
||||
break
|
||||
if not self.config_location:
|
||||
self.config_location = list(importlib.resources.path('bdfr', 'default_config.cfg').gen)[0]
|
||||
shutil.copy(self.config_location, Path(self.config_directory, 'default_config.cfg'))
|
||||
if not self.config_location:
|
||||
raise errors.BulkDownloaderException('Could not find a configuration file to load')
|
||||
self.cfg_parser.read(self.config_location)
|
||||
|
||||
def _create_file_logger(self):
|
||||
main_logger = logging.getLogger()
|
||||
if self.args.log is None:
|
||||
log_path = Path(self.config_directory, 'log_output.txt')
|
||||
else:
|
||||
log_path = Path(self.args.log).resolve().expanduser()
|
||||
if not log_path.parent.exists():
|
||||
raise errors.BulkDownloaderException(f'Designated location for logfile does not exist')
|
||||
backup_count = self.cfg_parser.getint('DEFAULT', 'backup_log_count', fallback=3)
|
||||
file_handler = logging.handlers.RotatingFileHandler(
|
||||
log_path,
|
||||
mode='a',
|
||||
backupCount=backup_count,
|
||||
)
|
||||
if log_path.exists():
|
||||
try:
|
||||
file_handler.doRollover()
|
||||
except PermissionError as e:
|
||||
logger.critical(
|
||||
'Cannot rollover logfile, make sure this is the only '
|
||||
'BDFR process or specify alternate logfile location')
|
||||
raise
|
||||
formatter = logging.Formatter('[%(asctime)s - %(name)s - %(levelname)s] - %(message)s')
|
||||
file_handler.setFormatter(formatter)
|
||||
file_handler.setLevel(0)
|
||||
|
||||
main_logger.addHandler(file_handler)
|
||||
|
||||
@staticmethod
|
||||
def _sanitise_subreddit_name(subreddit: str) -> str:
|
||||
pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)/?$')
|
||||
match = re.match(pattern, subreddit)
|
||||
if not match:
|
||||
raise errors.BulkDownloaderException(f'Could not find subreddit name in string {subreddit}')
|
||||
return match.group(1)
|
||||
|
||||
@staticmethod
|
||||
def _split_args_input(entries: list[str]) -> set[str]:
|
||||
all_entries = []
|
||||
split_pattern = re.compile(r'[,;]\s?')
|
||||
for entry in entries:
|
||||
results = re.split(split_pattern, entry)
|
||||
all_entries.extend([RedditDownloader._sanitise_subreddit_name(name) for name in results])
|
||||
return set(all_entries)
|
||||
|
||||
def _get_subreddits(self) -> list[praw.models.ListingGenerator]:
|
||||
if self.args.subreddit:
|
||||
out = []
|
||||
for reddit in self._split_args_input(self.args.subreddit):
|
||||
try:
|
||||
reddit = self.reddit_instance.subreddit(reddit)
|
||||
try:
|
||||
self._check_subreddit_status(reddit)
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(e)
|
||||
continue
|
||||
if self.args.search:
|
||||
out.append(reddit.search(
|
||||
self.args.search,
|
||||
sort=self.sort_filter.name.lower(),
|
||||
limit=self.args.limit,
|
||||
time_filter=self.time_filter.value,
|
||||
))
|
||||
logger.debug(
|
||||
f'Added submissions from subreddit {reddit} with the search term "{self.args.search}"')
|
||||
else:
|
||||
out.append(self._create_filtered_listing_generator(reddit))
|
||||
logger.debug(f'Added submissions from subreddit {reddit}')
|
||||
except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
|
||||
logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
|
||||
return out
|
||||
else:
|
||||
return []
|
||||
|
||||
def _resolve_user_name(self):
|
||||
if self.args.user == 'me':
|
||||
if self.authenticated:
|
||||
self.args.user = self.reddit_instance.user.me().name
|
||||
logger.log(9, f'Resolved user to {self.args.user}')
|
||||
else:
|
||||
self.args.user = None
|
||||
logger.warning('To use "me" as a user, an authenticated Reddit instance must be used')
|
||||
|
||||
def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
|
||||
supplied_submissions = []
|
||||
for sub_id in self.args.link:
|
||||
if len(sub_id) == 6:
|
||||
supplied_submissions.append(self.reddit_instance.submission(id=sub_id))
|
||||
else:
|
||||
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
|
||||
return [supplied_submissions]
|
||||
|
||||
def _determine_sort_function(self) -> Callable:
|
||||
if self.sort_filter is RedditTypes.SortType.NEW:
|
||||
sort_function = praw.models.Subreddit.new
|
||||
elif self.sort_filter is RedditTypes.SortType.RISING:
|
||||
sort_function = praw.models.Subreddit.rising
|
||||
elif self.sort_filter is RedditTypes.SortType.CONTROVERSIAL:
|
||||
sort_function = praw.models.Subreddit.controversial
|
||||
elif self.sort_filter is RedditTypes.SortType.TOP:
|
||||
sort_function = praw.models.Subreddit.top
|
||||
else:
|
||||
sort_function = praw.models.Subreddit.hot
|
||||
return sort_function
|
||||
|
||||
def _get_multireddits(self) -> list[Iterator]:
|
||||
if self.args.multireddit:
|
||||
out = []
|
||||
for multi in self._split_args_input(self.args.multireddit):
|
||||
try:
|
||||
multi = self.reddit_instance.multireddit(self.args.user, multi)
|
||||
if not multi.subreddits:
|
||||
raise errors.BulkDownloaderException
|
||||
out.append(self._create_filtered_listing_generator(multi))
|
||||
logger.debug(f'Added submissions from multireddit {multi}')
|
||||
except (errors.BulkDownloaderException, praw.exceptions.PRAWException, prawcore.PrawcoreException) as e:
|
||||
logger.error(f'Failed to get submissions for multireddit {multi}: {e}')
|
||||
return out
|
||||
else:
|
||||
return []
|
||||
|
||||
def _create_filtered_listing_generator(self, reddit_source) -> Iterator:
|
||||
sort_function = self._determine_sort_function()
|
||||
if self.sort_filter in (RedditTypes.SortType.TOP, RedditTypes.SortType.CONTROVERSIAL):
|
||||
return sort_function(reddit_source, limit=self.args.limit, time_filter=self.time_filter.value)
|
||||
else:
|
||||
return sort_function(reddit_source, limit=self.args.limit)
|
||||
|
||||
def _get_user_data(self) -> list[Iterator]:
|
||||
if any([self.args.submitted, self.args.upvoted, self.args.saved]):
|
||||
if self.args.user:
|
||||
try:
|
||||
self._check_user_existence(self.args.user)
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(e)
|
||||
return []
|
||||
generators = []
|
||||
if self.args.submitted:
|
||||
logger.debug(f'Retrieving submitted posts of user {self.args.user}')
|
||||
generators.append(self._create_filtered_listing_generator(
|
||||
self.reddit_instance.redditor(self.args.user).submissions,
|
||||
))
|
||||
if not self.authenticated and any((self.args.upvoted, self.args.saved)):
|
||||
logger.warning('Accessing user lists requires authentication')
|
||||
else:
|
||||
if self.args.upvoted:
|
||||
logger.debug(f'Retrieving upvoted posts of user {self.args.user}')
|
||||
generators.append(self.reddit_instance.redditor(self.args.user).upvoted(limit=self.args.limit))
|
||||
if self.args.saved:
|
||||
logger.debug(f'Retrieving saved posts of user {self.args.user}')
|
||||
generators.append(self.reddit_instance.redditor(self.args.user).saved(limit=self.args.limit))
|
||||
return generators
|
||||
else:
|
||||
logger.warning('A user must be supplied to download user data')
|
||||
return []
|
||||
else:
|
||||
return []
|
||||
|
||||
def _check_user_existence(self, name: str):
|
||||
user = self.reddit_instance.redditor(name=name)
|
||||
try:
|
||||
if user.id:
|
||||
return
|
||||
except prawcore.exceptions.NotFound:
|
||||
raise errors.BulkDownloaderException(f'Could not find user {name}')
|
||||
except AttributeError:
|
||||
if hasattr(user, 'is_suspended'):
|
||||
raise errors.BulkDownloaderException(f'User {name} is banned')
|
||||
|
||||
def _create_file_name_formatter(self) -> FileNameFormatter:
|
||||
return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme, self.args.time_format)
|
||||
|
||||
def _create_time_filter(self) -> RedditTypes.TimeType:
|
||||
try:
|
||||
return RedditTypes.TimeType[self.args.time.upper()]
|
||||
except (KeyError, AttributeError):
|
||||
return RedditTypes.TimeType.ALL
|
||||
|
||||
def _create_sort_filter(self) -> RedditTypes.SortType:
|
||||
try:
|
||||
return RedditTypes.SortType[self.args.sort.upper()]
|
||||
except (KeyError, AttributeError):
|
||||
return RedditTypes.SortType.HOT
|
||||
|
||||
def _create_download_filter(self) -> DownloadFilter:
|
||||
return DownloadFilter(self.args.skip, self.args.skip_domain)
|
||||
|
||||
def _create_authenticator(self) -> SiteAuthenticator:
|
||||
return SiteAuthenticator(self.cfg_parser)
|
||||
|
||||
def download(self):
|
||||
for generator in self.reddit_lists:
|
||||
for submission in generator:
|
||||
if submission.id in self.excluded_submission_ids:
|
||||
logger.debug(f'Object {submission.id} in exclusion list, skipping')
|
||||
continue
|
||||
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
|
||||
logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
|
||||
else:
|
||||
logger.debug(f'Attempting to download submission {submission.id}')
|
||||
self._download_submission(submission)
|
||||
self._download_submission(submission)
|
||||
|
||||
def _download_submission(self, submission: praw.models.Submission):
|
||||
if not isinstance(submission, praw.models.Submission):
|
||||
if submission.id in self.excluded_submission_ids:
|
||||
logger.debug(f'Object {submission.id} in exclusion list, skipping')
|
||||
return
|
||||
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
|
||||
logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
|
||||
return
|
||||
elif (submission.author and submission.author.name in self.args.ignore_user) or \
|
||||
(submission.author is None and 'DELETED' in self.args.ignore_user):
|
||||
logger.debug(
|
||||
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
|
||||
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
|
||||
return
|
||||
elif not isinstance(submission, praw.models.Submission):
|
||||
logger.warning(f'{submission.id} is not a submission')
|
||||
return
|
||||
elif not self.download_filter.check_url(submission.url):
|
||||
logger.debug(f'Submission {submission.id} filtered due to URL {submission.url}')
|
||||
return
|
||||
|
||||
logger.debug(f'Attempting to download submission {submission.id}')
|
||||
try:
|
||||
downloader_class = DownloadFactory.pull_lever(submission.url)
|
||||
downloader = downloader_class(submission)
|
||||
@@ -407,7 +72,9 @@ class RedditDownloader:
|
||||
except errors.NotADownloadableLinkError as e:
|
||||
logger.error(f'Could not download submission {submission.id}: {e}')
|
||||
return
|
||||
|
||||
if downloader_class.__name__.lower() in self.args.disable_module:
|
||||
logger.debug(f'Submission {submission.id} skipped due to disabled module {downloader_class.__name__}')
|
||||
return
|
||||
try:
|
||||
content = downloader.find_resources(self.authenticator)
|
||||
except errors.SiteDownloaderError as e:
|
||||
@@ -415,34 +82,43 @@ class RedditDownloader:
|
||||
return
|
||||
for destination, res in self.file_name_formatter.format_resource_paths(content, self.download_directory):
|
||||
if destination.exists():
|
||||
logger.debug(f'File {destination} already exists, continuing')
|
||||
logger.debug(f'File {destination} from submission {submission.id} already exists, continuing')
|
||||
continue
|
||||
elif not self.download_filter.check_resource(res):
|
||||
logger.debug(f'Download filter removed {submission.id} with URL {submission.url}')
|
||||
else:
|
||||
try:
|
||||
res.download(self.args.max_wait_time)
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(f'Failed to download resource {res.url} in submission {submission.id} '
|
||||
f'with downloader {downloader_class.__name__}: {e}')
|
||||
logger.debug(f'Download filter removed {submission.id} file with URL {submission.url}')
|
||||
continue
|
||||
try:
|
||||
res.download({'max_wait_time': self.args.max_wait_time})
|
||||
except errors.BulkDownloaderException as e:
|
||||
logger.error(f'Failed to download resource {res.url} in submission {submission.id} '
|
||||
f'with downloader {downloader_class.__name__}: {e}')
|
||||
return
|
||||
resource_hash = res.hash.hexdigest()
|
||||
destination.parent.mkdir(parents=True, exist_ok=True)
|
||||
if resource_hash in self.master_hash_list:
|
||||
if self.args.no_dupes:
|
||||
logger.info(
|
||||
f'Resource hash {resource_hash} from submission {submission.id} downloaded elsewhere')
|
||||
return
|
||||
resource_hash = res.hash.hexdigest()
|
||||
destination.parent.mkdir(parents=True, exist_ok=True)
|
||||
if resource_hash in self.master_hash_list:
|
||||
if self.args.no_dupes:
|
||||
logger.info(
|
||||
f'Resource hash {resource_hash} from submission {submission.id} downloaded elsewhere')
|
||||
return
|
||||
elif self.args.make_hard_links:
|
||||
self.master_hash_list[resource_hash].link_to(destination)
|
||||
logger.info(
|
||||
f'Hard link made linking {destination} to {self.master_hash_list[resource_hash]}')
|
||||
return
|
||||
elif self.args.make_hard_links:
|
||||
self.master_hash_list[resource_hash].link_to(destination)
|
||||
logger.info(
|
||||
f'Hard link made linking {destination} to {self.master_hash_list[resource_hash]}'
|
||||
f' in submission {submission.id}')
|
||||
return
|
||||
try:
|
||||
with open(destination, 'wb') as file:
|
||||
file.write(res.content)
|
||||
logger.debug(f'Written file to {destination}')
|
||||
self.master_hash_list[resource_hash] = destination
|
||||
logger.debug(f'Hash added to master list: {resource_hash}')
|
||||
logger.info(f'Downloaded submission {submission.id} from {submission.subreddit.display_name}')
|
||||
except OSError as e:
|
||||
logger.exception(e)
|
||||
logger.error(f'Failed to write file in submission {submission.id} to {destination}: {e}')
|
||||
return
|
||||
creation_time = time.mktime(datetime.fromtimestamp(submission.created_utc).timetuple())
|
||||
os.utime(destination, (creation_time, creation_time))
|
||||
self.master_hash_list[resource_hash] = destination
|
||||
logger.debug(f'Hash added to master list: {resource_hash}')
|
||||
logger.info(f'Downloaded submission {submission.id} from {submission.subreddit.display_name}')
|
||||
|
||||
@staticmethod
|
||||
def scan_existing_files(directory: Path) -> dict[str, Path]:
|
||||
@@ -457,27 +133,3 @@ class RedditDownloader:
|
||||
|
||||
hash_list = {res[1]: res[0] for res in results}
|
||||
return hash_list
|
||||
|
||||
def _read_excluded_ids(self) -> set[str]:
|
||||
out = []
|
||||
out.extend(self.args.exclude_id)
|
||||
for id_file in self.args.exclude_id_file:
|
||||
id_file = Path(id_file).resolve().expanduser()
|
||||
if not id_file.exists():
|
||||
logger.warning(f'ID exclusion file at {id_file} does not exist')
|
||||
continue
|
||||
with open(id_file, 'r') as file:
|
||||
for line in file:
|
||||
out.append(line.strip())
|
||||
return set(out)
|
||||
|
||||
@staticmethod
|
||||
def _check_subreddit_status(subreddit: praw.models.Subreddit):
|
||||
if subreddit.display_name == 'all':
|
||||
return
|
||||
try:
|
||||
assert subreddit.id
|
||||
except prawcore.NotFound:
|
||||
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
|
||||
except prawcore.Forbidden:
|
||||
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')
|
||||
|
||||
@@ -4,6 +4,7 @@ import datetime
|
||||
import logging
|
||||
import platform
|
||||
import re
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
@@ -104,32 +105,54 @@ class FileNameFormatter:
|
||||
) -> Path:
|
||||
subfolder = Path(
|
||||
destination_directory,
|
||||
*[self._format_name(resource.source_submission, part) for part in self.directory_format_string]
|
||||
*[self._format_name(resource.source_submission, part) for part in self.directory_format_string],
|
||||
)
|
||||
index = f'_{str(index)}' if index else ''
|
||||
if not resource.extension:
|
||||
raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
|
||||
ending = index + resource.extension
|
||||
file_name = str(self._format_name(resource.source_submission, self.file_format_string))
|
||||
file_name = self._limit_file_name_length(file_name, ending)
|
||||
if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
|
||||
ending = index + '.' + resource.extension
|
||||
else:
|
||||
ending = index + resource.extension
|
||||
|
||||
try:
|
||||
file_path = Path(subfolder, file_name)
|
||||
file_path = self.limit_file_name_length(file_name, ending, subfolder)
|
||||
except TypeError:
|
||||
raise BulkDownloaderException(f'Could not determine path name: {subfolder}, {index}, {resource.extension}')
|
||||
return file_path
|
||||
|
||||
@staticmethod
|
||||
def _limit_file_name_length(filename: str, ending: str) -> str:
|
||||
def limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
|
||||
root = root.resolve().expanduser()
|
||||
possible_id = re.search(r'((?:_\w{6})?$)', filename)
|
||||
if possible_id:
|
||||
ending = possible_id.group(1) + ending
|
||||
filename = filename[:possible_id.start()]
|
||||
max_length_chars = 255 - len(ending)
|
||||
max_length_bytes = 255 - len(ending.encode('utf-8'))
|
||||
while len(filename) > max_length_chars or len(filename.encode('utf-8')) > max_length_bytes:
|
||||
max_path = FileNameFormatter.find_max_path_length()
|
||||
max_file_part_length_chars = 255 - len(ending)
|
||||
max_file_part_length_bytes = 255 - len(ending.encode('utf-8'))
|
||||
max_path_length = max_path - len(ending) - len(str(root)) - 1
|
||||
|
||||
out = Path(root, filename + ending)
|
||||
while any([len(filename) > max_file_part_length_chars,
|
||||
len(filename.encode('utf-8')) > max_file_part_length_bytes,
|
||||
len(str(out)) > max_path_length,
|
||||
]):
|
||||
filename = filename[:-1]
|
||||
return filename + ending
|
||||
out = Path(root, filename + ending)
|
||||
|
||||
return out
|
||||
|
||||
@staticmethod
|
||||
def find_max_path_length() -> int:
|
||||
try:
|
||||
return int(subprocess.check_output(['getconf', 'PATH_MAX', '/']))
|
||||
except (ValueError, subprocess.CalledProcessError, OSError):
|
||||
if platform.system() == 'Windows':
|
||||
return 260
|
||||
else:
|
||||
return 4096
|
||||
|
||||
def format_resource_paths(
|
||||
self,
|
||||
|
||||
@@ -5,8 +5,8 @@ import hashlib
|
||||
import logging
|
||||
import re
|
||||
import time
|
||||
from typing import Optional
|
||||
import urllib.parse
|
||||
from typing import Callable, Optional
|
||||
|
||||
import _hashlib
|
||||
import requests
|
||||
@@ -18,40 +18,26 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Resource:
|
||||
def __init__(self, source_submission: Submission, url: str, extension: str = None):
|
||||
def __init__(self, source_submission: Submission, url: str, download_function: Callable, extension: str = None):
|
||||
self.source_submission = source_submission
|
||||
self.content: Optional[bytes] = None
|
||||
self.url = url
|
||||
self.hash: Optional[_hashlib.HASH] = None
|
||||
self.extension = extension
|
||||
self.download_function = download_function
|
||||
if not self.extension:
|
||||
self.extension = self._determine_extension()
|
||||
|
||||
@staticmethod
|
||||
def retry_download(url: str, max_wait_time: int) -> Optional[bytes]:
|
||||
wait_time = 60
|
||||
try:
|
||||
response = requests.get(url)
|
||||
if re.match(r'^2\d{2}', str(response.status_code)) and response.content:
|
||||
return response.content
|
||||
elif response.status_code in (408, 429):
|
||||
raise requests.exceptions.ConnectionError(f'Response code {response.status_code}')
|
||||
else:
|
||||
raise BulkDownloaderException(
|
||||
f'Unrecoverable error requesting resource: HTTP Code {response.status_code}')
|
||||
except requests.exceptions.ConnectionError as e:
|
||||
logger.warning(f'Error occured downloading from {url}, waiting {wait_time} seconds: {e}')
|
||||
time.sleep(wait_time)
|
||||
if wait_time < max_wait_time:
|
||||
return Resource.retry_download(url, max_wait_time)
|
||||
else:
|
||||
logger.error(f'Max wait time exceeded for resource at url {url}')
|
||||
raise
|
||||
def retry_download(url: str) -> Callable:
|
||||
return lambda global_params: Resource.http_download(url, global_params)
|
||||
|
||||
def download(self, max_wait_time: int):
|
||||
def download(self, download_parameters: Optional[dict] = None):
|
||||
if download_parameters is None:
|
||||
download_parameters = {}
|
||||
if not self.content:
|
||||
try:
|
||||
content = self.retry_download(self.url, max_wait_time)
|
||||
content = self.download_function(download_parameters)
|
||||
except requests.exceptions.ConnectionError as e:
|
||||
raise BulkDownloaderException(f'Could not download resource: {e}')
|
||||
except BulkDownloaderException:
|
||||
@@ -70,3 +56,30 @@ class Resource:
|
||||
match = re.search(extension_pattern, stripped_url)
|
||||
if match:
|
||||
return match.group(1)
|
||||
|
||||
@staticmethod
|
||||
def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
|
||||
headers = download_parameters.get('headers')
|
||||
current_wait_time = 60
|
||||
if 'max_wait_time' in download_parameters:
|
||||
max_wait_time = download_parameters['max_wait_time']
|
||||
else:
|
||||
max_wait_time = 300
|
||||
while True:
|
||||
try:
|
||||
response = requests.get(url, headers=headers)
|
||||
if re.match(r'^2\d{2}', str(response.status_code)) and response.content:
|
||||
return response.content
|
||||
elif response.status_code in (408, 429):
|
||||
raise requests.exceptions.ConnectionError(f'Response code {response.status_code}')
|
||||
else:
|
||||
raise BulkDownloaderException(
|
||||
f'Unrecoverable error requesting resource: HTTP Code {response.status_code}')
|
||||
except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError) as e:
|
||||
logger.warning(f'Error occured downloading from {url}, waiting {current_wait_time} seconds: {e}')
|
||||
time.sleep(current_wait_time)
|
||||
if current_wait_time < max_wait_time:
|
||||
current_wait_time += 60
|
||||
else:
|
||||
logger.error(f'Max wait time exceeded for resource at url {url}')
|
||||
raise
|
||||
|
||||
@@ -14,4 +14,4 @@ class Direct(BaseDownloader):
|
||||
super().__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
return [Resource(self.post, self.post.url)]
|
||||
return [Resource(self.post, self.post.url, Resource.retry_download(self.post.url))]
|
||||
|
||||
@@ -9,27 +9,32 @@ from bdfr.exceptions import NotADownloadableLinkError
|
||||
from bdfr.site_downloaders.base_downloader import BaseDownloader
|
||||
from bdfr.site_downloaders.direct import Direct
|
||||
from bdfr.site_downloaders.erome import Erome
|
||||
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
|
||||
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
|
||||
from bdfr.site_downloaders.gallery import Gallery
|
||||
from bdfr.site_downloaders.gfycat import Gfycat
|
||||
from bdfr.site_downloaders.imgur import Imgur
|
||||
from bdfr.site_downloaders.pornhub import PornHub
|
||||
from bdfr.site_downloaders.redgifs import Redgifs
|
||||
from bdfr.site_downloaders.self_post import SelfPost
|
||||
from bdfr.site_downloaders.vidble import Vidble
|
||||
from bdfr.site_downloaders.youtube import Youtube
|
||||
|
||||
|
||||
class DownloadFactory:
|
||||
@staticmethod
|
||||
def pull_lever(url: str) -> Type[BaseDownloader]:
|
||||
sanitised_url = DownloadFactory._sanitise_url(url)
|
||||
if re.match(r'(i\.)?imgur.*\.gifv$', sanitised_url):
|
||||
sanitised_url = DownloadFactory.sanitise_url(url)
|
||||
if re.match(r'(i\.)?imgur.*\.gif.+$', sanitised_url):
|
||||
return Imgur
|
||||
elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url):
|
||||
elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url) and \
|
||||
not DownloadFactory.is_web_resource(sanitised_url):
|
||||
return Direct
|
||||
elif re.match(r'erome\.com.*', sanitised_url):
|
||||
return Erome
|
||||
elif re.match(r'reddit\.com/gallery/.*', sanitised_url):
|
||||
return Gallery
|
||||
elif re.match(r'patreon\.com.*', sanitised_url):
|
||||
return Gallery
|
||||
elif re.match(r'gfycat\.', sanitised_url):
|
||||
return Gfycat
|
||||
elif re.match(r'(m\.)?imgur.*', sanitised_url):
|
||||
@@ -42,16 +47,39 @@ class DownloadFactory:
|
||||
return Youtube
|
||||
elif re.match(r'i\.redd\.it.*', sanitised_url):
|
||||
return Direct
|
||||
elif YoutubeDlFallback.can_handle_link(sanitised_url):
|
||||
return YoutubeDlFallback
|
||||
elif re.match(r'pornhub\.com.*', sanitised_url):
|
||||
return PornHub
|
||||
elif re.match(r'vidble\.com', sanitised_url):
|
||||
return Vidble
|
||||
elif YtdlpFallback.can_handle_link(sanitised_url):
|
||||
return YtdlpFallback
|
||||
else:
|
||||
raise NotADownloadableLinkError(
|
||||
f'No downloader module exists for url {url}')
|
||||
raise NotADownloadableLinkError(f'No downloader module exists for url {url}')
|
||||
|
||||
@staticmethod
|
||||
def _sanitise_url(url: str) -> str:
|
||||
def sanitise_url(url: str) -> str:
|
||||
beginning_regex = re.compile(r'\s*(www\.?)?')
|
||||
split_url = urllib.parse.urlsplit(url)
|
||||
split_url = split_url.netloc + split_url.path
|
||||
split_url = re.sub(beginning_regex, '', split_url)
|
||||
return split_url
|
||||
|
||||
@staticmethod
|
||||
def is_web_resource(url: str) -> bool:
|
||||
web_extensions = (
|
||||
'asp',
|
||||
'aspx',
|
||||
'cfm',
|
||||
'cfml',
|
||||
'css',
|
||||
'htm',
|
||||
'html',
|
||||
'js',
|
||||
'php',
|
||||
'php3',
|
||||
'xhtml',
|
||||
)
|
||||
if re.match(rf'(?i).*/.*\.({"|".join(web_extensions)})$', url):
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Optional
|
||||
from typing import Callable, Optional
|
||||
|
||||
import bs4
|
||||
from praw.models import Submission
|
||||
@@ -29,7 +29,7 @@ class Erome(BaseDownloader):
|
||||
for link in links:
|
||||
if not re.match(r'https?://.*', link):
|
||||
link = 'https://' + link
|
||||
out.append(Resource(self.post, link))
|
||||
out.append(Resource(self.post, link, self.erome_download(link)))
|
||||
return out
|
||||
|
||||
@staticmethod
|
||||
@@ -43,3 +43,14 @@ class Erome(BaseDownloader):
|
||||
out.extend([vid.get('src') for vid in videos])
|
||||
|
||||
return set(out)
|
||||
|
||||
@staticmethod
|
||||
def erome_download(url: str) -> Callable:
|
||||
download_parameters = {
|
||||
'headers': {
|
||||
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
|
||||
' Chrome/88.0.4324.104 Safari/537.36',
|
||||
'Referer': 'https://www.erome.com/',
|
||||
},
|
||||
}
|
||||
return lambda global_params: Resource.http_download(url, global_params | download_parameters)
|
||||
|
||||
@@ -4,9 +4,9 @@
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
import youtube_dl
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import NotADownloadableLinkError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.site_downloaders.fallback_downloaders.fallback_downloader import BaseFallbackDownloader
|
||||
@@ -15,26 +15,24 @@ from bdfr.site_downloaders.youtube import Youtube
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class YoutubeDlFallback(BaseFallbackDownloader, Youtube):
|
||||
class YtdlpFallback(BaseFallbackDownloader, Youtube):
|
||||
def __init__(self, post: Submission):
|
||||
super(YoutubeDlFallback, self).__init__(post)
|
||||
super(YtdlpFallback, self).__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
out = super()._download_video({})
|
||||
out = Resource(
|
||||
self.post,
|
||||
self.post.url,
|
||||
super()._download_video({}),
|
||||
super().get_video_attributes(self.post.url)['ext'],
|
||||
)
|
||||
return [out]
|
||||
|
||||
@staticmethod
|
||||
def can_handle_link(url: str) -> bool:
|
||||
yt_logger = logging.getLogger('youtube-dl')
|
||||
yt_logger.setLevel(logging.CRITICAL)
|
||||
with youtube_dl.YoutubeDL({
|
||||
'logger': yt_logger,
|
||||
}) as ydl:
|
||||
try:
|
||||
result = ydl.extract_info(url, download=False)
|
||||
if result:
|
||||
return True
|
||||
except Exception as e:
|
||||
logger.exception(e)
|
||||
return False
|
||||
return False
|
||||
try:
|
||||
attributes = YtdlpFallback.get_video_attributes(url)
|
||||
except NotADownloadableLinkError:
|
||||
return False
|
||||
if attributes:
|
||||
return True
|
||||
@@ -1,10 +1,9 @@
|
||||
#!/usr/bin/env python3
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
import bs4
|
||||
import requests
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
@@ -20,21 +19,30 @@ class Gallery(BaseDownloader):
|
||||
super().__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
image_urls = self._get_links(self.post.url)
|
||||
try:
|
||||
image_urls = self._get_links(self.post.gallery_data['items'])
|
||||
except (AttributeError, TypeError):
|
||||
try:
|
||||
image_urls = self._get_links(self.post.crosspost_parent_list[0]['gallery_data']['items'])
|
||||
except (AttributeError, IndexError, TypeError, KeyError):
|
||||
logger.error(f'Could not find gallery data in submission {self.post.id}')
|
||||
logger.exception('Gallery image find failure')
|
||||
raise SiteDownloaderError('No images found in Reddit gallery')
|
||||
|
||||
if not image_urls:
|
||||
raise SiteDownloaderError('No images found in Reddit gallery')
|
||||
return [Resource(self.post, url) for url in image_urls]
|
||||
return [Resource(self.post, url, Resource.retry_download(url)) for url in image_urls]
|
||||
|
||||
@staticmethod
|
||||
def _get_links(url: str) -> list[str]:
|
||||
resource_headers = {
|
||||
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
|
||||
' Chrome/67.0.3396.87 Safari/537.36 OPR/54.0.2952.64',
|
||||
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
|
||||
}
|
||||
page = Gallery.retrieve_url(url, headers=resource_headers)
|
||||
soup = bs4.BeautifulSoup(page.text, 'html.parser')
|
||||
|
||||
links = soup.findAll('a', attrs={'target': '_blank', 'href': re.compile(r'https://preview\.redd\.it.*')})
|
||||
links = [link.get('href') for link in links]
|
||||
return links
|
||||
@ staticmethod
|
||||
def _get_links(id_dict: list[dict]) -> list[str]:
|
||||
out = []
|
||||
for item in id_dict:
|
||||
image_id = item['media_id']
|
||||
possible_extensions = ('.jpg', '.png', '.gif', '.gifv', '.jpeg')
|
||||
for extension in possible_extensions:
|
||||
test_url = f'https://i.redd.it/{image_id}{extension}'
|
||||
response = requests.head(test_url)
|
||||
if response.status_code == 200:
|
||||
out.append(test_url)
|
||||
break
|
||||
return out
|
||||
|
||||
@@ -27,6 +27,7 @@ class Gfycat(Redgifs):
|
||||
|
||||
response = Gfycat.retrieve_url(url)
|
||||
if re.search(r'(redgifs|gifdeliverynetwork)', response.url):
|
||||
url = url.lower() # Fixes error with old gfycat/redgifs links
|
||||
return Redgifs._get_link(url)
|
||||
|
||||
soup = BeautifulSoup(response.text, 'html.parser')
|
||||
|
||||
@@ -32,14 +32,19 @@ class Imgur(BaseDownloader):
|
||||
return out
|
||||
|
||||
def _compute_image_url(self, image: dict) -> Resource:
|
||||
image_url = 'https://i.imgur.com/' + image['hash'] + self._validate_extension(image['ext'])
|
||||
return Resource(self.post, image_url)
|
||||
ext = self._validate_extension(image['ext'])
|
||||
if image.get('prefer_video', False):
|
||||
ext = '.mp4'
|
||||
|
||||
image_url = 'https://i.imgur.com/' + image['hash'] + ext
|
||||
return Resource(self.post, image_url, Resource.retry_download(image_url))
|
||||
|
||||
@staticmethod
|
||||
def _get_data(link: str) -> dict:
|
||||
if re.match(r'.*\.gifv$', link):
|
||||
link = link.rstrip('?')
|
||||
if re.match(r'(?i).*\.gif.+$', link):
|
||||
link = link.replace('i.imgur', 'imgur')
|
||||
link = link.rstrip('.gifv')
|
||||
link = re.sub('(?i)\\.gif.+$', '', link)
|
||||
|
||||
res = Imgur.retrieve_url(link, cookies={'over18': '1', 'postpagebeta': '0'})
|
||||
|
||||
@@ -71,6 +76,7 @@ class Imgur(BaseDownloader):
|
||||
|
||||
@staticmethod
|
||||
def _validate_extension(extension_suffix: str) -> str:
|
||||
extension_suffix = extension_suffix.strip('?1')
|
||||
possible_extensions = ('.jpg', '.png', '.mp4', '.gif')
|
||||
selection = [ext for ext in possible_extensions if ext == extension_suffix]
|
||||
if len(selection) == 1:
|
||||
|
||||
37
bdfr/site_downloaders/pornhub.py
Normal file
37
bdfr/site_downloaders/pornhub.py
Normal file
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import logging
|
||||
from typing import Optional
|
||||
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.site_downloaders.youtube import Youtube
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class PornHub(Youtube):
|
||||
def __init__(self, post: Submission):
|
||||
super().__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
ytdl_options = {
|
||||
'format': 'best',
|
||||
'nooverwrites': True,
|
||||
}
|
||||
if video_attributes := super().get_video_attributes(self.post.url):
|
||||
extension = video_attributes['ext']
|
||||
else:
|
||||
raise SiteDownloaderError()
|
||||
|
||||
out = Resource(
|
||||
self.post,
|
||||
self.post.url,
|
||||
super()._download_video(ytdl_options),
|
||||
extension,
|
||||
)
|
||||
return [out]
|
||||
@@ -4,7 +4,6 @@ import json
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
@@ -19,7 +18,7 @@ class Redgifs(BaseDownloader):
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
media_url = self._get_link(self.post.url)
|
||||
return [Resource(self.post, media_url, '.mp4')]
|
||||
return [Resource(self.post, media_url, Resource.retry_download(media_url), '.mp4')]
|
||||
|
||||
@staticmethod
|
||||
def _get_link(url: str) -> str:
|
||||
|
||||
@@ -17,7 +17,7 @@ class SelfPost(BaseDownloader):
|
||||
super().__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
out = Resource(self.post, self.post.url, '.txt')
|
||||
out = Resource(self.post, self.post.url, lambda: None, '.txt')
|
||||
out.content = self.export_to_string().encode('utf-8')
|
||||
out.create_hash()
|
||||
return [out]
|
||||
|
||||
54
bdfr/site_downloaders/vidble.py
Normal file
54
bdfr/site_downloaders/vidble.py
Normal file
@@ -0,0 +1,54 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
import itertools
|
||||
import logging
|
||||
import re
|
||||
from typing import Optional
|
||||
|
||||
import bs4
|
||||
import requests
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.site_downloaders.base_downloader import BaseDownloader
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class Vidble(BaseDownloader):
|
||||
def __init__(self, post: Submission):
|
||||
super().__init__(post)
|
||||
|
||||
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
|
||||
try:
|
||||
res = self.get_links(self.post.url)
|
||||
except AttributeError:
|
||||
raise SiteDownloaderError(f'Could not read page at {self.post.url}')
|
||||
if not res:
|
||||
raise SiteDownloaderError(rf'No resources found at {self.post.url}')
|
||||
res = [Resource(self.post, r, Resource.retry_download(r)) for r in res]
|
||||
return res
|
||||
|
||||
@staticmethod
|
||||
def get_links(url: str) -> set[str]:
|
||||
if not re.search(r'vidble.com/(show/|album/|watch\?v)', url):
|
||||
url = re.sub(r'/(\w*?)$', r'/show/\1', url)
|
||||
|
||||
page = requests.get(url)
|
||||
soup = bs4.BeautifulSoup(page.text, 'html.parser')
|
||||
content_div = soup.find('div', attrs={'id': 'ContentPlaceHolder1_divContent'})
|
||||
images = content_div.find_all('img')
|
||||
images = [i.get('src') for i in images]
|
||||
videos = content_div.find_all('source', attrs={'type': 'video/mp4'})
|
||||
videos = [v.get('src') for v in videos]
|
||||
resources = filter(None, itertools.chain(images, videos))
|
||||
resources = ['https://www.vidble.com' + r for r in resources]
|
||||
resources = [Vidble.change_med_url(r) for r in resources]
|
||||
return set(resources)
|
||||
|
||||
@staticmethod
|
||||
def change_med_url(url: str) -> str:
|
||||
out = re.sub(r'_med(\..{3,4})$', r'\1', url)
|
||||
return out
|
||||
@@ -3,12 +3,12 @@
|
||||
import logging
|
||||
import tempfile
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from typing import Callable, Optional
|
||||
|
||||
import youtube_dl
|
||||
import yt_dlp
|
||||
from praw.models import Submission
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
from bdfr.exceptions import NotADownloadableLinkError, SiteDownloaderError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.site_downloaders.base_downloader import BaseDownloader
|
||||
@@ -26,28 +26,48 @@ class Youtube(BaseDownloader):
|
||||
'playlistend': 1,
|
||||
'nooverwrites': True,
|
||||
}
|
||||
out = self._download_video(ytdl_options)
|
||||
return [out]
|
||||
download_function = self._download_video(ytdl_options)
|
||||
extension = self.get_video_attributes(self.post.url)['ext']
|
||||
res = Resource(self.post, self.post.url, download_function, extension)
|
||||
return [res]
|
||||
|
||||
def _download_video(self, ytdl_options: dict) -> Resource:
|
||||
def _download_video(self, ytdl_options: dict) -> Callable:
|
||||
yt_logger = logging.getLogger('youtube-dl')
|
||||
yt_logger.setLevel(logging.CRITICAL)
|
||||
ytdl_options['quiet'] = True
|
||||
ytdl_options['logger'] = yt_logger
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
download_path = Path(temp_dir).resolve()
|
||||
ytdl_options['outtmpl'] = str(download_path) + '/' + 'test.%(ext)s'
|
||||
try:
|
||||
with youtube_dl.YoutubeDL(ytdl_options) as ydl:
|
||||
ydl.download([self.post.url])
|
||||
except youtube_dl.DownloadError as e:
|
||||
raise SiteDownloaderError(f'Youtube download failed: {e}')
|
||||
|
||||
downloaded_file = list(download_path.iterdir())[0]
|
||||
extension = downloaded_file.suffix
|
||||
with open(downloaded_file, 'rb') as file:
|
||||
content = file.read()
|
||||
out = Resource(self.post, self.post.url, extension)
|
||||
out.content = content
|
||||
out.create_hash()
|
||||
return out
|
||||
def download(_: dict) -> bytes:
|
||||
with tempfile.TemporaryDirectory() as temp_dir:
|
||||
download_path = Path(temp_dir).resolve()
|
||||
ytdl_options['outtmpl'] = str(download_path) + '/' + 'test.%(ext)s'
|
||||
try:
|
||||
with yt_dlp.YoutubeDL(ytdl_options) as ydl:
|
||||
ydl.download([self.post.url])
|
||||
except yt_dlp.DownloadError as e:
|
||||
raise SiteDownloaderError(f'Youtube download failed: {e}')
|
||||
|
||||
downloaded_files = list(download_path.iterdir())
|
||||
if len(downloaded_files) > 0:
|
||||
downloaded_file = downloaded_files[0]
|
||||
else:
|
||||
raise NotADownloadableLinkError(f"No media exists in the URL {self.post.url}")
|
||||
with open(downloaded_file, 'rb') as file:
|
||||
content = file.read()
|
||||
return content
|
||||
return download
|
||||
|
||||
@staticmethod
|
||||
def get_video_attributes(url: str) -> dict:
|
||||
yt_logger = logging.getLogger('youtube-dl')
|
||||
yt_logger.setLevel(logging.CRITICAL)
|
||||
with yt_dlp.YoutubeDL({'logger': yt_logger, }) as ydl:
|
||||
try:
|
||||
result = ydl.extract_info(url, download=False)
|
||||
except Exception as e:
|
||||
logger.exception(e)
|
||||
raise NotADownloadableLinkError(f'Video info extraction failed for {url}')
|
||||
if 'ext' in result:
|
||||
return result
|
||||
else:
|
||||
raise NotADownloadableLinkError(f'Video info extraction failed for {url}')
|
||||
|
||||
@@ -1,2 +1,5 @@
|
||||
copy .\\bdfr\\default_config.cfg .\\test_config.cfg
|
||||
echo "`nuser_token = $env:REDDIT_TOKEN" >> ./test_config.cfg
|
||||
if (-not ([string]::IsNullOrEmpty($env:REDDIT_TOKEN)))
|
||||
{
|
||||
copy .\\bdfr\\default_config.cfg .\\test_config.cfg
|
||||
echo "`nuser_token = $env:REDDIT_TOKEN" >> ./test_config.cfg
|
||||
}
|
||||
@@ -1,2 +1,5 @@
|
||||
cp ./bdfr/default_config.cfg ./test_config.cfg
|
||||
echo -e "\nuser_token = $REDDIT_TOKEN" >> ./test_config.cfg
|
||||
if [ ! -z "$REDDIT_TOKEN" ]
|
||||
then
|
||||
cp ./bdfr/default_config.cfg ./test_config.cfg
|
||||
echo -e "\nuser_token = $REDDIT_TOKEN" >> ./test_config.cfg
|
||||
fi
|
||||
@@ -6,4 +6,4 @@ ffmpeg-python>=0.2.0
|
||||
praw>=7.2.0
|
||||
pyyaml>=5.4.1
|
||||
requests>=2.25.1
|
||||
youtube-dl>=2021.3.14
|
||||
yt-dlp>=2021.9.25
|
||||
21
scripts/extract_failed_ids.ps1
Normal file
21
scripts/extract_failed_ids.ps1
Normal file
@@ -0,0 +1,21 @@
|
||||
if (Test-Path -Path $args[0] -PathType Leaf) {
|
||||
$file=$args[0]
|
||||
}
|
||||
else {
|
||||
Write-Host "CANNOT FIND LOG FILE"
|
||||
Exit 1
|
||||
}
|
||||
|
||||
if ($args[1] -ne $null) {
|
||||
$output=$args[1]
|
||||
Write-Host "Outputting IDs to $output"
|
||||
}
|
||||
else {
|
||||
$output="./failed.txt"
|
||||
}
|
||||
|
||||
Select-String -Path $file -Pattern "Could not download submission" | ForEach-Object { -split $_.Line | Select-Object -Skip 11 | Select-Object -First 1 } | foreach { $_.substring(0,$_.Length-1) } >> $output
|
||||
Select-String -Path $file -Pattern "Failed to download resource" | ForEach-Object { -split $_.Line | Select-Object -Skip 14 | Select-Object -First 1 } >> $output
|
||||
Select-String -Path $file -Pattern "failed to download submission" | ForEach-Object { -split $_.Line | Select-Object -Skip 13 | Select-Object -First 1 } | foreach { $_.substring(0,$_.Length-1) } >> $output
|
||||
Select-String -Path $file -Pattern "Failed to write file" | ForEach-Object { -split $_.Line | Select-Object -Skip 13 | Select-Object -First 1 } >> $output
|
||||
Select-String -Path $file -Pattern "skipped due to disabled module" | ForEach-Object { -split $_.Line | Select-Object -Skip 8 | Select-Object -First 1 } >> $output
|
||||
@@ -11,8 +11,13 @@ if [ -n "$2" ]; then
|
||||
output="$2"
|
||||
echo "Outputting IDs to $output"
|
||||
else
|
||||
output="failed.txt"
|
||||
output="./failed.txt"
|
||||
fi
|
||||
|
||||
grep 'Could not download submission' "$file" | awk '{ print $12 }' | rev | cut -c 2- | rev >>"$output"
|
||||
grep 'Failed to download resource' "$file" | awk '{ print $15 }' >>"$output"
|
||||
{
|
||||
grep 'Could not download submission' "$file" | awk '{ print $12 }' | rev | cut -c 2- | rev ;
|
||||
grep 'Failed to download resource' "$file" | awk '{ print $15 }' ;
|
||||
grep 'failed to download submission' "$file" | awk '{ print $14 }' | rev | cut -c 2- | rev ;
|
||||
grep 'Failed to write file' "$file" | awk '{ print $14 }' ;
|
||||
grep 'skipped due to disabled module' "$file" | awk '{ print $9 }' ;
|
||||
} >>"$output"
|
||||
|
||||
21
scripts/extract_successful_ids.ps1
Normal file
21
scripts/extract_successful_ids.ps1
Normal file
@@ -0,0 +1,21 @@
|
||||
if (Test-Path -Path $args[0] -PathType Leaf) {
|
||||
$file=$args[0]
|
||||
}
|
||||
else {
|
||||
Write-Host "CANNOT FIND LOG FILE"
|
||||
Exit 1
|
||||
}
|
||||
|
||||
if ($args[1] -ne $null) {
|
||||
$output=$args[1]
|
||||
Write-Host "Outputting IDs to $output"
|
||||
}
|
||||
else {
|
||||
$output="./successful.txt"
|
||||
}
|
||||
|
||||
Select-String -Path $file -Pattern "Downloaded submission" | ForEach-Object { -split $_.Line | Select-Object -Last 3 | Select-Object -SkipLast 2 } >> $output
|
||||
Select-String -Path $file -Pattern "Resource hash" | ForEach-Object { -split $_.Line | Select-Object -Last 3 | Select-Object -SkipLast 2 } >> $output
|
||||
Select-String -Path $file -Pattern "Download filter" | ForEach-Object { -split $_.Line | Select-Object -Last 4 | Select-Object -SkipLast 3 } >> $output
|
||||
Select-String -Path $file -Pattern "already exists, continuing" | ForEach-Object { -split $_.Line | Select-Object -Last 4 | Select-Object -SkipLast 3 } >> $output
|
||||
Select-String -Path $file -Pattern "Hard link made" | ForEach-Object { -split $_.Line | Select-Object -Last 1 } >> $output
|
||||
@@ -11,7 +11,13 @@ if [ -n "$2" ]; then
|
||||
output="$2"
|
||||
echo "Outputting IDs to $output"
|
||||
else
|
||||
output="successful.txt"
|
||||
output="./successful.txt"
|
||||
fi
|
||||
|
||||
grep 'Downloaded submission' "$file" | awk '{ print $(NF-2) }' >> "$output"
|
||||
{
|
||||
grep 'Downloaded submission' "$file" | awk '{ print $(NF-2) }' ;
|
||||
grep 'Resource hash' "$file" | awk '{ print $(NF-2) }' ;
|
||||
grep 'Download filter' "$file" | awk '{ print $(NF-3) }' ;
|
||||
grep 'already exists, continuing' "$file" | awk '{ print $(NF-3) }' ;
|
||||
grep 'Hard link made' "$file" | awk '{ print $(NF) }' ;
|
||||
} >> "$output"
|
||||
|
||||
30
scripts/print_summary.ps1
Normal file
30
scripts/print_summary.ps1
Normal file
@@ -0,0 +1,30 @@
|
||||
if (Test-Path -Path $args[0] -PathType Leaf) {
|
||||
$file=$args[0]
|
||||
}
|
||||
else {
|
||||
Write-Host "CANNOT FIND LOG FILE"
|
||||
Exit 1
|
||||
}
|
||||
|
||||
if ($args[1] -ne $null) {
|
||||
$output=$args[1]
|
||||
Write-Host "Outputting IDs to $output"
|
||||
}
|
||||
else {
|
||||
$output="./successful.txt"
|
||||
}
|
||||
|
||||
Write-Host -NoNewline "Downloaded submissions: "
|
||||
Write-Host (Select-String -Path $file -Pattern "Downloaded submission" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Failed downloads: "
|
||||
Write-Host (Select-String -Path $file -Pattern "failed to download submission" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Files already downloaded: "
|
||||
Write-Host (Select-String -Path $file -Pattern "already exists, continuing" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Hard linked submissions: "
|
||||
Write-Host (Select-String -Path $file -Pattern "Hard link made" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Excluded submissions: "
|
||||
Write-Host (Select-String -Path $file -Pattern "in exclusion list" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Files with existing hash skipped: "
|
||||
Write-Host (Select-String -Path $file -Pattern "downloaded elsewhere" -AllMatches).Matches.Count
|
||||
Write-Host -NoNewline "Submissions from excluded subreddits: "
|
||||
Write-Host (Select-String -Path $file -Pattern "in skip list" -AllMatches).Matches.Count
|
||||
13
scripts/tests/README.md
Normal file
13
scripts/tests/README.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Bash Scripts Testing
|
||||
|
||||
The `bats` framework is included and used to test the scripts included, specifically the scripts designed to parse through the logging output. As this involves delicate regex and indexes, it is necessary to test these.
|
||||
|
||||
## Running Tests
|
||||
|
||||
Running the tests are easy, and can be done with a single command. Once the working directory is this directory, run the following command.
|
||||
|
||||
```bash
|
||||
./bats/bin/bats *.bats
|
||||
```
|
||||
|
||||
This will run all test files that have the `.bats` suffix.
|
||||
1
scripts/tests/bats
Submodule
1
scripts/tests/bats
Submodule
Submodule scripts/tests/bats added at ce5ca2802f
@@ -0,0 +1 @@
|
||||
[2021-06-12 12:49:18,452 - bdfr.downloader - DEBUG] - Submission m2601g skipped due to disabled module Direct
|
||||
3
scripts/tests/example_logfiles/failed_no_downloader.txt
Normal file
3
scripts/tests/example_logfiles/failed_no_downloader.txt
Normal file
@@ -0,0 +1,3 @@
|
||||
[2021-06-12 11:13:35,665 - bdfr.downloader - ERROR] - Could not download submission nxv3ew: No downloader module exists for url https://www.biorxiv.org/content/10.1101/2021.06.11.447961v1?rss=1
|
||||
[2021-06-12 11:14:21,958 - bdfr.downloader - ERROR] - Could not download submission nxv3ek: No downloader module exists for url https://alkossegyedit.hu/termek/pluss-macko-poloval-20cm/?feed_id=34832&_unique_id=60c40a1190ccb&utm_source=Reddit&utm_medium=AEAdmin&utm_campaign=Poster
|
||||
[2021-06-12 11:17:53,456 - bdfr.downloader - ERROR] - Could not download submission nxv3ea: No downloader module exists for url https://www.biorxiv.org/content/10.1101/2021.06.11.448067v1?rss=1
|
||||
2
scripts/tests/example_logfiles/failed_resource_error.txt
Normal file
2
scripts/tests/example_logfiles/failed_resource_error.txt
Normal file
@@ -0,0 +1,2 @@
|
||||
[2021-06-12 11:18:25,794 - bdfr.downloader - ERROR] - Failed to download resource https://i.redd.it/61fniokpjq471.jpg in submission nxv3dt with downloader Direct: Unrecoverable error requesting resource: HTTP Code 404
|
||||
|
||||
@@ -0,0 +1,2 @@
|
||||
[2021-06-12 08:38:35,657 - bdfr.downloader - ERROR] - Site Gallery failed to download submission nxr7x9: No images found in Reddit gallery
|
||||
[2021-06-12 08:47:22,005 - bdfr.downloader - ERROR] - Site Gallery failed to download submission nxpn0h: Server responded with 503 to https://www.reddit.com/gallery/nxpkvh
|
||||
1
scripts/tests/example_logfiles/failed_write_error.txt
Normal file
1
scripts/tests/example_logfiles/failed_write_error.txt
Normal file
@@ -0,0 +1 @@
|
||||
[2021-06-09 22:01:04,530 - bdfr.downloader - ERROR] - Failed to write file in submission nnboza to C:\Users\Yoga 14\path\to\output\ThotNetwork\KatieCarmine_I POST A NEW VIDEO ALMOST EVERYDAY AND YOU NEVER HAVE TO PAY EXTRA FOR IT! I want to share my sex life with you! Only $6 per month and you get full access to over 400 videos of me getting fuck_nnboza.mp4: [Errno 2] No such file or directory: 'C:\\Users\\Yoga 14\\path\\to\\output\\ThotNetwork\\KatieCarmine_I POST A NEW VIDEO ALMOST EVERYDAY AND YOU NEVER HAVE TO PAY EXTRA FOR IT! I want to share my sex life with you! Only $6 per month and you get full access to over 400 videos of me getting fuck_nnboza.mp4'
|
||||
@@ -0,0 +1,3 @@
|
||||
[2021-06-12 08:41:51,464 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxry0l.jpg from submission nxry0l already exists, continuing
|
||||
[2021-06-12 08:41:51,469 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxrlgn.gif from submission nxrlgn already exists, continuing
|
||||
[2021-06-12 08:41:51,472 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxrq9g.png from submission nxrq9g already exists, continuing
|
||||
@@ -0,0 +1,3 @@
|
||||
[2021-06-10 20:36:48,722 - bdfr.downloader - DEBUG] - Download filter removed nwfirr with URL https://www.youtube.com/watch?v=NVSiX0Tsees
|
||||
[2021-06-12 19:56:36,848 - bdfr.downloader - DEBUG] - Download filter removed nwfgcl with URL https://www.reddit.com/r/MaliciousCompliance/comments/nwfgcl/new_guy_decided_to_play_manager_alright/
|
||||
[2021-06-12 19:56:28,587 - bdfr.downloader - DEBUG] - Download filter removed nxuxjy with URL https://www.reddit.com/r/MaliciousCompliance/comments/nxuxjy/you_want_an_omelette_with_nothing_inside_okay/
|
||||
@@ -0,0 +1,7 @@
|
||||
[2021-06-12 11:58:53,864 - bdfr.downloader - INFO] - Downloaded submission nxui9y from tumblr
|
||||
[2021-06-12 11:58:56,618 - bdfr.downloader - INFO] - Downloaded submission nxsr4r from tumblr
|
||||
[2021-06-12 11:58:59,026 - bdfr.downloader - INFO] - Downloaded submission nxviir from tumblr
|
||||
[2021-06-12 11:59:00,289 - bdfr.downloader - INFO] - Downloaded submission nxusva from tumblr
|
||||
[2021-06-12 11:59:00,735 - bdfr.downloader - INFO] - Downloaded submission nxvko7 from tumblr
|
||||
[2021-06-12 11:59:01,215 - bdfr.downloader - INFO] - Downloaded submission nxvd63 from tumblr
|
||||
[2021-06-12 11:59:13,891 - bdfr.downloader - INFO] - Downloaded submission nn9cor from tumblr
|
||||
1
scripts/tests/example_logfiles/succeed_hard_link.txt
Normal file
1
scripts/tests/example_logfiles/succeed_hard_link.txt
Normal file
@@ -0,0 +1 @@
|
||||
[2021-06-11 17:33:02,118 - bdfr.downloader - INFO] - Hard link made linking /media/smaug/private/reddit/tumblr/nwnp2n.jpg to /media/smaug/private/reddit/tumblr/nwskqb.jpg in submission nwnp2n
|
||||
1
scripts/tests/example_logfiles/succeed_resource_hash.txt
Normal file
1
scripts/tests/example_logfiles/succeed_resource_hash.txt
Normal file
@@ -0,0 +1 @@
|
||||
[2021-06-11 17:33:02,118 - bdfr.downloader - INFO] - Resource hash aaaaaaaaaaaaaaaaaaaaaaa from submission n86jk8 downloaded elsewhere
|
||||
43
scripts/tests/test_extract_failed_ids.bats
Normal file
43
scripts/tests/test_extract_failed_ids.bats
Normal file
@@ -0,0 +1,43 @@
|
||||
setup() {
|
||||
load ./test_helper/bats-support/load
|
||||
load ./test_helper/bats-assert/load
|
||||
}
|
||||
|
||||
teardown() {
|
||||
rm -f failed.txt
|
||||
}
|
||||
|
||||
@test "fail run no logfile" {
|
||||
run ../extract_failed_ids.sh
|
||||
assert_failure
|
||||
}
|
||||
|
||||
@test "fail no downloader module" {
|
||||
run ../extract_failed_ids.sh ./example_logfiles/failed_no_downloader.txt
|
||||
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "3" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "fail resource error" {
|
||||
run ../extract_failed_ids.sh ./example_logfiles/failed_resource_error.txt
|
||||
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "fail site downloader error" {
|
||||
run ../extract_failed_ids.sh ./example_logfiles/failed_sitedownloader_error.txt
|
||||
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "2" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "fail failed file write" {
|
||||
run ../extract_failed_ids.sh ./example_logfiles/failed_write_error.txt
|
||||
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "fail disabled module" {
|
||||
run ../extract_failed_ids.sh ./example_logfiles/failed_disabled_module.txt
|
||||
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
|
||||
}
|
||||
38
scripts/tests/test_extract_successful_ids.bats
Normal file
38
scripts/tests/test_extract_successful_ids.bats
Normal file
@@ -0,0 +1,38 @@
|
||||
setup() {
|
||||
load ./test_helper/bats-support/load
|
||||
load ./test_helper/bats-assert/load
|
||||
}
|
||||
|
||||
teardown() {
|
||||
rm -f successful.txt
|
||||
}
|
||||
|
||||
@test "success downloaded submission" {
|
||||
run ../extract_successful_ids.sh ./example_logfiles/succeed_downloaded_submission.txt
|
||||
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "7" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "success resource hash" {
|
||||
run ../extract_successful_ids.sh ./example_logfiles/succeed_resource_hash.txt
|
||||
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "1" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "success download filter" {
|
||||
run ../extract_successful_ids.sh ./example_logfiles/succeed_download_filter.txt
|
||||
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "3" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "success already exists" {
|
||||
run ../extract_successful_ids.sh ./example_logfiles/succeed_already_exists.txt
|
||||
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "3" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
|
||||
}
|
||||
|
||||
@test "success hard link" {
|
||||
run ../extract_successful_ids.sh ./example_logfiles/succeed_hard_link.txt
|
||||
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "1" ];
|
||||
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
|
||||
}
|
||||
1
scripts/tests/test_helper/bats-assert
Submodule
1
scripts/tests/test_helper/bats-assert
Submodule
Submodule scripts/tests/test_helper/bats-assert added at e0de84e9c0
1
scripts/tests/test_helper/bats-support
Submodule
1
scripts/tests/test_helper/bats-support
Submodule
Submodule scripts/tests/test_helper/bats-support added at d140a65044
@@ -4,7 +4,7 @@ description_file = README.md
|
||||
description_content_type = text/markdown
|
||||
home_page = https://github.com/aliparlakci/bulk-downloader-for-reddit
|
||||
keywords = reddit, download, archive
|
||||
version = 2.1.0
|
||||
version = 2.5.2
|
||||
author = Ali Parlakci
|
||||
author_email = parlakciali@gmail.com
|
||||
maintainer = Serene Arc
|
||||
|
||||
@@ -15,6 +15,7 @@ from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
|
||||
'subreddit': 'Python',
|
||||
'submission': 'mgi4op',
|
||||
'submission_title': '76% Faster CPython',
|
||||
'distinguished': None,
|
||||
}),
|
||||
))
|
||||
def test_get_comment_details(test_comment_id: str, expected_dict: dict, reddit_instance: praw.Reddit):
|
||||
|
||||
@@ -26,6 +26,13 @@ def test_get_comments(test_submission_id: str, min_comments: int, reddit_instanc
|
||||
'author': 'sinjen-tos',
|
||||
'id': 'm3reby',
|
||||
'link_flair_text': 'image',
|
||||
'pinned': False,
|
||||
'spoiler': False,
|
||||
'over_18': False,
|
||||
'locked': False,
|
||||
'distinguished': None,
|
||||
'created_utc': 1615583837,
|
||||
'permalink': '/r/australia/comments/m3reby/this_little_guy_fell_out_of_a_tree_and_in_front/'
|
||||
}),
|
||||
('m3kua3', {'author': 'DELETED'}),
|
||||
))
|
||||
|
||||
2
tests/integration_tests/__init__.py
Normal file
2
tests/integration_tests/__init__.py
Normal file
@@ -0,0 +1,2 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
123
tests/integration_tests/test_archive_integration.py
Normal file
123
tests/integration_tests/test_archive_integration.py
Normal file
@@ -0,0 +1,123 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import re
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from click.testing import CliRunner
|
||||
|
||||
from bdfr.__main__ import cli
|
||||
|
||||
does_test_config_exist = Path('../test_config.cfg').exists()
|
||||
|
||||
|
||||
def copy_test_config(run_path: Path):
|
||||
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
|
||||
|
||||
|
||||
def create_basic_args_for_archive_runner(test_args: list[str], run_path: Path):
|
||||
copy_test_config(run_path)
|
||||
out = [
|
||||
'archive',
|
||||
str(run_path),
|
||||
'-v',
|
||||
'--config', str(Path(run_path, '../test_config.cfg')),
|
||||
'--log', str(Path(run_path, 'test_log.txt')),
|
||||
] + test_args
|
||||
return out
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['-l', 'gstd4hk'],
|
||||
['-l', 'm2601g', '-f', 'yaml'],
|
||||
['-l', 'n60t4c', '-f', 'xml'],
|
||||
))
|
||||
def test_cli_archive_single(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'Mindustry', '-L', 25],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--format', 'xml'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--format', 'yaml'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--sort', 'new'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day', '--sort', 'new'],
|
||||
))
|
||||
def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--user', 'me', '--authenticate', '--all-comments', '-L', '10'],
|
||||
['--user', 'me', '--user', 'djnish', '--authenticate', '--all-comments', '-L', '10'],
|
||||
))
|
||||
def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--comment-context', '--link', 'gxqapql'],
|
||||
))
|
||||
def test_cli_archive_full_context(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Converting comment' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'all', '-L', 100],
|
||||
['--subreddit', 'all', '-L', 100, '--sort', 'new'],
|
||||
))
|
||||
def test_cli_archive_long(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
|
||||
))
|
||||
def test_cli_archive_ignore_user(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'being an ignored user' in result.output
|
||||
assert 'Attempting to archive submission' not in result.output
|
||||
43
tests/integration_tests/test_clone_integration.py
Normal file
43
tests/integration_tests/test_clone_integration.py
Normal file
@@ -0,0 +1,43 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from click.testing import CliRunner
|
||||
|
||||
from bdfr.__main__ import cli
|
||||
|
||||
does_test_config_exist = Path('../test_config.cfg').exists()
|
||||
|
||||
|
||||
def copy_test_config(run_path: Path):
|
||||
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
|
||||
|
||||
|
||||
def create_basic_args_for_cloner_runner(test_args: list[str], tmp_path: Path):
|
||||
out = [
|
||||
'clone',
|
||||
str(tmp_path),
|
||||
'-v',
|
||||
'--config', 'test_config.cfg',
|
||||
'--log', str(Path(tmp_path, 'test_log.txt')),
|
||||
] + test_args
|
||||
return out
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['-l', 'm2601g'],
|
||||
['-s', 'TrollXChromosomes/', '-L', 1],
|
||||
))
|
||||
def test_cli_scrape_general(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_cloner_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Downloaded submission' in result.output
|
||||
assert 'Record for entry item' in result.output
|
||||
@@ -1,7 +1,7 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import re
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
@@ -9,26 +9,20 @@ from click.testing import CliRunner
|
||||
|
||||
from bdfr.__main__ import cli
|
||||
|
||||
does_test_config_exist = Path('test_config.cfg').exists()
|
||||
does_test_config_exist = Path('../test_config.cfg').exists()
|
||||
|
||||
|
||||
def create_basic_args_for_download_runner(test_args: list[str], tmp_path: Path):
|
||||
def copy_test_config(run_path: Path):
|
||||
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
|
||||
|
||||
|
||||
def create_basic_args_for_download_runner(test_args: list[str], run_path: Path):
|
||||
copy_test_config(run_path)
|
||||
out = [
|
||||
'download', str(tmp_path),
|
||||
'download', str(run_path),
|
||||
'-v',
|
||||
'--config', 'test_config.cfg',
|
||||
'--log', str(Path(tmp_path, 'test_log.txt')),
|
||||
] + test_args
|
||||
return out
|
||||
|
||||
|
||||
def create_basic_args_for_archive_runner(test_args: list[str], tmp_path: Path):
|
||||
out = [
|
||||
'archive',
|
||||
str(tmp_path),
|
||||
'-v',
|
||||
'--config', 'test_config.cfg',
|
||||
'--log', str(Path(tmp_path, 'test_log.txt')),
|
||||
'--config', str(Path(run_path, '../test_config.cfg')),
|
||||
'--log', str(Path(run_path, 'test_log.txt')),
|
||||
] + test_args
|
||||
return out
|
||||
|
||||
@@ -61,6 +55,38 @@ def test_cli_download_subreddits(test_args: list[str], tmp_path: Path):
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Added submissions from subreddit ' in result.output
|
||||
assert 'Downloaded submission' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.authenticated
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['-s', 'hentai', '-L', 10, '--search', 'red', '--authenticate'],
|
||||
))
|
||||
def test_cli_download_search_subreddits_authenticated(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Added submissions from subreddit ' in result.output
|
||||
assert 'Downloaded submission' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.authenticated
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'friends', '-L', 10, '--authenticate'],
|
||||
))
|
||||
def test_cli_download_user_specific_subreddits(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Added submissions from subreddit ' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@@ -117,6 +143,7 @@ def test_cli_download_multireddit_nonexistent(test_args: list[str], tmp_path: Pa
|
||||
@pytest.mark.authenticated
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--user', 'djnish', '--submitted', '--user', 'FriesWithThat', '-L', 10],
|
||||
['--user', 'me', '--upvoted', '--authenticate', '-L', 10],
|
||||
['--user', 'me', '--saved', '--authenticate', '-L', 10],
|
||||
['--user', 'me', '--submitted', '--authenticate', '-L', 10],
|
||||
@@ -151,7 +178,7 @@ def test_cli_download_user_data_bad_me_unauthenticated(test_args: list[str], tmp
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'python', '-L', 10, '--search-existing'],
|
||||
['--subreddit', 'python', '-L', 1, '--search-existing'],
|
||||
))
|
||||
def test_cli_download_search_existing(test_args: list[str], tmp_path: Path):
|
||||
Path(tmp_path, 'test.txt').touch()
|
||||
@@ -168,13 +195,14 @@ def test_cli_download_search_existing(test_args: list[str], tmp_path: Path):
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'tumblr', '-L', '25', '--skip', 'png', '--skip', 'jpg'],
|
||||
['--subreddit', 'MaliciousCompliance', '-L', '25', '--skip', 'txt'],
|
||||
['--subreddit', 'tumblr', '-L', '10', '--skip-domain', 'i.redd.it'],
|
||||
))
|
||||
def test_cli_download_download_filters(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Download filter removed ' in result.output
|
||||
assert any((string in result.output for string in ('Download filter removed ', 'filtered due to URL')))
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@@ -191,70 +219,6 @@ def test_cli_download_long(test_args: list[str], tmp_path: Path):
|
||||
assert result.exit_code == 0
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['-l', 'gstd4hk'],
|
||||
['-l', 'm2601g', '-f', 'yaml'],
|
||||
['-l', 'n60t4c', '-f', 'xml'],
|
||||
))
|
||||
def test_cli_archive_single(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'Mindustry', '-L', 25],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--format', 'xml'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--format', 'yaml'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--sort', 'new'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day'],
|
||||
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day', '--sort', 'new'],
|
||||
))
|
||||
def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--user', 'me', '--authenticate', '--all-comments', '-L', '10'],
|
||||
))
|
||||
def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--subreddit', 'all', '-L', 100],
|
||||
['--subreddit', 'all', '-L', 100, '--sort', 'new'],
|
||||
))
|
||||
def test_cli_archive_long(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.slow
|
||||
@@ -265,12 +229,15 @@ def test_cli_archive_long(test_args: list[str], tmp_path: Path):
|
||||
['--user', 'sdclhgsolgjeroij', '--upvoted', '-L', 10],
|
||||
['--subreddit', 'submitters', '-L', 10], # Private subreddit
|
||||
['--subreddit', 'donaldtrump', '-L', 10], # Banned subreddit
|
||||
['--user', 'djnish', '--user', 'helen_darten', '-m', 'cuteanimalpics', '-L', 10],
|
||||
['--subreddit', 'friends', '-L', 10],
|
||||
))
|
||||
def test_cli_download_soft_fail(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Downloaded' not in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@@ -333,9 +300,55 @@ def test_cli_download_subreddit_exclusion(test_args: list[str], tmp_path: Path):
|
||||
['--file-scheme', '{TITLE}'],
|
||||
['--file-scheme', '{TITLE}_test_{SUBREDDIT}'],
|
||||
))
|
||||
def test_cli_file_scheme_warning(test_args: list[str], tmp_path: Path):
|
||||
def test_cli_download_file_scheme_warning(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Some files might not be downloaded due to name conflicts' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['-l', 'm2601g', '--disable-module', 'Direct'],
|
||||
['-l', 'nnb9vs', '--disable-module', 'YoutubeDlFallback'],
|
||||
['-l', 'nnb9vs', '--disable-module', 'youtubedlfallback'],
|
||||
))
|
||||
def test_cli_download_disable_modules(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'skipped due to disabled module' in result.output
|
||||
assert 'Downloaded submission' not in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
def test_cli_download_include_id_file(tmp_path: Path):
|
||||
test_file = Path(tmp_path, 'include.txt')
|
||||
test_args = ['--include-id-file', str(test_file)]
|
||||
test_file.write_text('odr9wg\nody576')
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Downloaded submission' in result.output
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
|
||||
@pytest.mark.parametrize('test_args', (
|
||||
['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
|
||||
))
|
||||
def test_cli_download_ignore_user(test_args: list[str], tmp_path: Path):
|
||||
runner = CliRunner()
|
||||
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
|
||||
result = runner.invoke(cli, test_args)
|
||||
assert result.exit_code == 0
|
||||
assert 'Downloaded submission' not in result.output
|
||||
assert 'being an ignored user' in result.output
|
||||
@@ -4,8 +4,9 @@ from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from bdfr.exceptions import NotADownloadableLinkError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
|
||||
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@@ -13,16 +14,26 @@ from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import Youtub
|
||||
('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', True),
|
||||
('https://www.youtube.com/watch?v=P19nvJOmqCc', True),
|
||||
('https://www.example.com/test', False),
|
||||
('https://milesmatrix.bandcamp.com/album/la-boum/', False),
|
||||
))
|
||||
def test_can_handle_link(test_url: str, expected: bool):
|
||||
result = YoutubeDlFallback.can_handle_link(test_url)
|
||||
result = YtdlpFallback.can_handle_link(test_url)
|
||||
assert result == expected
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize('test_url', (
|
||||
'https://milesmatrix.bandcamp.com/album/la-boum/',
|
||||
))
|
||||
def test_info_extraction_bad(test_url: str):
|
||||
with pytest.raises(NotADownloadableLinkError):
|
||||
YtdlpFallback.get_video_attributes(test_url)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
|
||||
('https://streamable.com/dt46y', '1e7f4928e55de6e3ca23d85cc9246bbb'),
|
||||
('https://streamable.com/dt46y', 'b7e465adaade5f2b6d8c2b4b7d0a2878'),
|
||||
('https://streamable.com/t8sem', '49b2d1220c485455548f1edbc05d4ecf'),
|
||||
('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', '21968d3d92161ea5e0abdcaf6311b06c'),
|
||||
('https://v.redd.it/9z1dnk3xr5k61', '351a2b57e888df5ccbc508056511f38d'),
|
||||
@@ -30,8 +41,10 @@ def test_can_handle_link(test_url: str, expected: bool):
|
||||
def test_find_resources(test_url: str, expected_hash: str):
|
||||
test_submission = MagicMock()
|
||||
test_submission.url = test_url
|
||||
downloader = YoutubeDlFallback(test_submission)
|
||||
downloader = YtdlpFallback(test_submission)
|
||||
resources = downloader.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
for res in resources:
|
||||
res.download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
@@ -21,5 +21,5 @@ def test_download_resource(test_url: str, expected_hash: str):
|
||||
resources = test_site.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
resources[0].download(120)
|
||||
resources[0].download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
|
||||
@@ -9,10 +9,11 @@ from bdfr.site_downloaders.base_downloader import BaseDownloader
|
||||
from bdfr.site_downloaders.direct import Direct
|
||||
from bdfr.site_downloaders.download_factory import DownloadFactory
|
||||
from bdfr.site_downloaders.erome import Erome
|
||||
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
|
||||
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
|
||||
from bdfr.site_downloaders.gallery import Gallery
|
||||
from bdfr.site_downloaders.gfycat import Gfycat
|
||||
from bdfr.site_downloaders.imgur import Imgur
|
||||
from bdfr.site_downloaders.pornhub import PornHub
|
||||
from bdfr.site_downloaders.redgifs import Redgifs
|
||||
from bdfr.site_downloaders.self_post import SelfPost
|
||||
from bdfr.site_downloaders.youtube import Youtube
|
||||
@@ -29,6 +30,7 @@ from bdfr.site_downloaders.youtube import Youtube
|
||||
('https://imgur.com/BuzvZwb.gifv', Imgur),
|
||||
('https://i.imgur.com/6fNdLst.gif', Direct),
|
||||
('https://imgur.com/a/MkxAzeg', Imgur),
|
||||
('https://i.imgur.com/OGeVuAe.giff', Imgur),
|
||||
('https://www.reddit.com/gallery/lu93m7', Gallery),
|
||||
('https://gfycat.com/concretecheerfulfinwhale', Gfycat),
|
||||
('https://www.erome.com/a/NWGw0F09', Erome),
|
||||
@@ -40,10 +42,12 @@ from bdfr.site_downloaders.youtube import Youtube
|
||||
('https://i.imgur.com/3SKrQfK.jpg?1', Direct),
|
||||
('https://dynasty-scans.com/system/images_images/000/017/819/original/80215103_p0.png?1612232781', Direct),
|
||||
('https://m.imgur.com/a/py3RW0j', Imgur),
|
||||
('https://v.redd.it/9z1dnk3xr5k61', YoutubeDlFallback),
|
||||
('https://streamable.com/dt46y', YoutubeDlFallback),
|
||||
('https://vimeo.com/channels/31259/53576664', YoutubeDlFallback),
|
||||
('http://video.pbs.org/viralplayer/2365173446/', YoutubeDlFallback),
|
||||
('https://v.redd.it/9z1dnk3xr5k61', YtdlpFallback),
|
||||
('https://streamable.com/dt46y', YtdlpFallback),
|
||||
('https://vimeo.com/channels/31259/53576664', YtdlpFallback),
|
||||
('http://video.pbs.org/viralplayer/2365173446/', YtdlpFallback),
|
||||
('https://www.pornhub.com/view_video.php?viewkey=ph5a2ee0461a8d0', PornHub),
|
||||
('https://www.patreon.com/posts/minecart-track-59346560', Gallery),
|
||||
))
|
||||
def test_factory_lever_good(test_submission_url: str, expected_class: BaseDownloader, reddit_instance: praw.Reddit):
|
||||
result = DownloadFactory.pull_lever(test_submission_url)
|
||||
@@ -69,6 +73,19 @@ def test_factory_lever_bad(test_url: str):
|
||||
('https://youtube.com/watch?v=Gv8Wz74FjVA', 'youtube.com/watch'),
|
||||
('https://i.imgur.com/BuzvZwb.gifv', 'i.imgur.com/BuzvZwb.gifv'),
|
||||
))
|
||||
def test_sanitise_urll(test_url: str, expected: str):
|
||||
result = DownloadFactory._sanitise_url(test_url)
|
||||
def test_sanitise_url(test_url: str, expected: str):
|
||||
result = DownloadFactory.sanitise_url(test_url)
|
||||
assert result == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_url', 'expected'), (
|
||||
('www.example.com/test.asp', True),
|
||||
('www.example.com/test.html', True),
|
||||
('www.example.com/test.js', True),
|
||||
('www.example.com/test.xhtml', True),
|
||||
('www.example.com/test.mp4', False),
|
||||
('www.example.com/test.png', False),
|
||||
))
|
||||
def test_is_web_resource(test_url: str, expected: bool):
|
||||
result = DownloadFactory.is_web_resource(test_url)
|
||||
assert result == expected
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import re
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
@@ -11,47 +11,37 @@ from bdfr.site_downloaders.erome import Erome
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize(('test_url', 'expected_urls'), (
|
||||
('https://www.erome.com/a/vqtPuLXh', (
|
||||
'https://s11.erome.com/365/vqtPuLXh/KH2qBT99_480p.mp4',
|
||||
r'https://s\d+.erome.com/365/vqtPuLXh/KH2qBT99_480p.mp4',
|
||||
)),
|
||||
('https://www.erome.com/a/ORhX0FZz', (
|
||||
'https://s4.erome.com/355/ORhX0FZz/9IYQocM9_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/9eEDc8xm_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/EvApC7Rp_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/LruobtMs_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/TJNmSUU5_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/X11Skh6Z_480p.mp4',
|
||||
'https://s4.erome.com/355/ORhX0FZz/bjlTkpn7_480p.mp4'
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/9IYQocM9_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/9eEDc8xm_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/EvApC7Rp_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/LruobtMs_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/TJNmSUU5_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/X11Skh6Z_480p.mp4',
|
||||
r'https://s\d+.erome.com/355/ORhX0FZz/bjlTkpn7_480p.mp4'
|
||||
)),
|
||||
))
|
||||
def test_get_link(test_url: str, expected_urls: tuple[str]):
|
||||
result = Erome. _get_links(test_url)
|
||||
assert set(result) == set(expected_urls)
|
||||
assert all([any([re.match(p, r) for r in result]) for p in expected_urls])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hashes'), (
|
||||
('https://www.erome.com/a/vqtPuLXh', {
|
||||
'5da2a8d60d87bed279431fdec8e7d72f'
|
||||
}),
|
||||
('https://www.erome.com/i/ItASD33e', {
|
||||
'b0d73fedc9ce6995c2f2c4fdb6f11eff'
|
||||
}),
|
||||
('https://www.erome.com/a/lGrcFxmb', {
|
||||
'0e98f9f527a911dcedde4f846bb5b69f',
|
||||
'25696ae364750a5303fc7d7dc78b35c1',
|
||||
'63775689f438bd393cde7db6d46187de',
|
||||
'a1abf398cfd4ef9cfaf093ceb10c746a',
|
||||
'bd9e1a4ea5ef0d6ba47fb90e337c2d14'
|
||||
}),
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hashes_len'), (
|
||||
('https://www.erome.com/a/vqtPuLXh', 1),
|
||||
('https://www.erome.com/a/4tP3KI6F', 1),
|
||||
))
|
||||
def test_download_resource(test_url: str, expected_hashes: tuple[str]):
|
||||
def test_download_resource(test_url: str, expected_hashes_len: int):
|
||||
# Can't compare hashes for this test, Erome doesn't return the exact same file from request to request so the hash
|
||||
# will change back and forth randomly
|
||||
mock_submission = MagicMock()
|
||||
mock_submission.url = test_url
|
||||
test_site = Erome(mock_submission)
|
||||
resources = test_site.find_resources()
|
||||
[res.download(120) for res in resources]
|
||||
for res in resources:
|
||||
res.download()
|
||||
resource_hashes = [res.hash.hexdigest() for res in resources]
|
||||
assert len(resource_hashes) == len(expected_hashes)
|
||||
assert len(resource_hashes) == expected_hashes_len
|
||||
|
||||
@@ -4,34 +4,37 @@
|
||||
import praw
|
||||
import pytest
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
from bdfr.site_downloaders.gallery import Gallery
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize(('test_url', 'expected'), (
|
||||
('https://www.reddit.com/gallery/m6lvrh', {
|
||||
'https://preview.redd.it/18nzv9ch0hn61.jpg?width=4160&'
|
||||
'format=pjpg&auto=webp&s=470a825b9c364e0eace0036882dcff926f821de8',
|
||||
'https://preview.redd.it/jqkizcch0hn61.jpg?width=4160&'
|
||||
'format=pjpg&auto=webp&s=ae4f552a18066bb6727676b14f2451c5feecf805',
|
||||
'https://preview.redd.it/k0fnqzbh0hn61.jpg?width=4160&'
|
||||
'format=pjpg&auto=webp&s=c6a10fececdc33983487c16ad02219fd3fc6cd76',
|
||||
'https://preview.redd.it/m3gamzbh0hn61.jpg?width=4160&'
|
||||
'format=pjpg&auto=webp&s=0dd90f324711851953e24873290b7f29ec73c444'
|
||||
@pytest.mark.parametrize(('test_ids', 'expected'), (
|
||||
([
|
||||
{'media_id': '18nzv9ch0hn61'},
|
||||
{'media_id': 'jqkizcch0hn61'},
|
||||
{'media_id': 'k0fnqzbh0hn61'},
|
||||
{'media_id': 'm3gamzbh0hn61'},
|
||||
], {
|
||||
'https://i.redd.it/18nzv9ch0hn61.jpg',
|
||||
'https://i.redd.it/jqkizcch0hn61.jpg',
|
||||
'https://i.redd.it/k0fnqzbh0hn61.jpg',
|
||||
'https://i.redd.it/m3gamzbh0hn61.jpg'
|
||||
}),
|
||||
('https://www.reddit.com/gallery/ljyy27', {
|
||||
'https://preview.redd.it/04vxj25uqih61.png?width=92&'
|
||||
'format=png&auto=webp&s=6513f3a5c5128ee7680d402cab5ea4fb2bbeead4',
|
||||
'https://preview.redd.it/0fnx83kpqih61.png?width=241&'
|
||||
'format=png&auto=webp&s=655e9deb6f499c9ba1476eaff56787a697e6255a',
|
||||
'https://preview.redd.it/7zkmr1wqqih61.png?width=237&'
|
||||
'format=png&auto=webp&s=19de214e634cbcad9959f19570c616e29be0c0b0',
|
||||
'https://preview.redd.it/u37k5gxrqih61.png?width=443&'
|
||||
'format=png&auto=webp&s=e74dae31841fe4a2545ffd794d3b25b9ff0eb862'
|
||||
([
|
||||
{'media_id': '04vxj25uqih61'},
|
||||
{'media_id': '0fnx83kpqih61'},
|
||||
{'media_id': '7zkmr1wqqih61'},
|
||||
{'media_id': 'u37k5gxrqih61'},
|
||||
], {
|
||||
'https://i.redd.it/04vxj25uqih61.png',
|
||||
'https://i.redd.it/0fnx83kpqih61.png',
|
||||
'https://i.redd.it/7zkmr1wqqih61.png',
|
||||
'https://i.redd.it/u37k5gxrqih61.png'
|
||||
}),
|
||||
))
|
||||
def test_gallery_get_links(test_url: str, expected: set[str]):
|
||||
results = Gallery._get_links(test_url)
|
||||
def test_gallery_get_links(test_ids: list[dict], expected: set[str]):
|
||||
results = Gallery._get_links(test_ids)
|
||||
assert set(results) == expected
|
||||
|
||||
|
||||
@@ -39,22 +42,45 @@ def test_gallery_get_links(test_url: str, expected: set[str]):
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_submission_id', 'expected_hashes'), (
|
||||
('m6lvrh', {
|
||||
'6c8a892ae8066cbe119218bcaac731e1',
|
||||
'93ce177f8cb7994906795f4615114d13',
|
||||
'9a293adf19354f14582608cf22124574',
|
||||
'b73e2c3daee02f99404644ea02f1ae65'
|
||||
'5c42b8341dd56eebef792e86f3981c6a',
|
||||
'8f38d76da46f4057bf2773a778e725ca',
|
||||
'f5776f8f90491c8b770b8e0a6bfa49b3',
|
||||
'fa1a43c94da30026ad19a9813a0ed2c2',
|
||||
}),
|
||||
('ljyy27', {
|
||||
'1bc38bed88f9c4770e22a37122d5c941',
|
||||
'2539a92b78f3968a069df2dffe2279f9',
|
||||
'37dea50281c219b905e46edeefc1a18d',
|
||||
'ec4924cf40549728dcf53dd40bc7a73c'
|
||||
'359c203ec81d0bc00e675f1023673238',
|
||||
'79262fd46bce5bfa550d878a3b898be4',
|
||||
'808c35267f44acb523ce03bfa5687404',
|
||||
'ec8b65bdb7f1279c4b3af0ea2bbb30c3',
|
||||
}),
|
||||
('obkflw', {
|
||||
'65163f685fb28c5b776e0e77122718be',
|
||||
'2a337eb5b13c34d3ca3f51b5db7c13e9',
|
||||
}),
|
||||
('rb3ub6', { # patreon post
|
||||
'748a976c6cedf7ea85b6f90e7cb685c7',
|
||||
'839796d7745e88ced6355504e1f74508',
|
||||
'bcdb740367d0f19f97a77e614b48a42d',
|
||||
'0f230b8c4e5d103d35a773fab9814ec3',
|
||||
'e5192d6cb4f84c4f4a658355310bf0f9',
|
||||
'91cbe172cd8ccbcf049fcea4204eb979',
|
||||
})
|
||||
))
|
||||
def test_gallery_download(test_submission_id: str, expected_hashes: set[str], reddit_instance: praw.Reddit):
|
||||
test_submission = reddit_instance.submission(id=test_submission_id)
|
||||
gallery = Gallery(test_submission)
|
||||
results = gallery.find_resources()
|
||||
[res.download(120) for res in results]
|
||||
[res.download() for res in results]
|
||||
hashes = [res.hash.hexdigest() for res in results]
|
||||
assert set(hashes) == expected_hashes
|
||||
|
||||
|
||||
@pytest.mark.parametrize('test_id', (
|
||||
'n0pyzp',
|
||||
'nxyahw',
|
||||
))
|
||||
def test_gallery_download_raises_right_error(test_id: str, reddit_instance: praw.Reddit):
|
||||
test_submission = reddit_instance.submission(id=test_id)
|
||||
gallery = Gallery(test_submission)
|
||||
with pytest.raises(SiteDownloaderError):
|
||||
gallery.find_resources()
|
||||
|
||||
@@ -13,7 +13,6 @@ from bdfr.site_downloaders.gfycat import Gfycat
|
||||
@pytest.mark.parametrize(('test_url', 'expected_url'), (
|
||||
('https://gfycat.com/definitivecaninecrayfish', 'https://giant.gfycat.com/DefinitiveCanineCrayfish.mp4'),
|
||||
('https://gfycat.com/dazzlingsilkyiguana', 'https://giant.gfycat.com/DazzlingSilkyIguana.mp4'),
|
||||
('https://gfycat.com/webbedimpurebutterfly', 'https://thumbs2.redgifs.com/WebbedImpureButterfly.mp4'),
|
||||
))
|
||||
def test_get_link(test_url: str, expected_url: str):
|
||||
result = Gfycat._get_link(test_url)
|
||||
@@ -32,5 +31,5 @@ def test_download_resource(test_url: str, expected_hash: str):
|
||||
resources = test_site.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
resources[0].download(120)
|
||||
resources[0].download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
|
||||
@@ -65,11 +65,11 @@ def test_get_data_album(test_url: str, expected_gen_dict: dict, expected_image_d
|
||||
{'hash': 'dLk3FGY', 'title': '', 'ext': '.mp4', 'animated': True}
|
||||
),
|
||||
(
|
||||
'https://imgur.com/BuzvZwb.gifv',
|
||||
'https://imgur.com/65FqTpT.gifv',
|
||||
{
|
||||
'hash': 'BuzvZwb',
|
||||
'hash': '65FqTpT',
|
||||
'title': '',
|
||||
'description': 'Akron Glass Works',
|
||||
'description': '',
|
||||
'animated': True,
|
||||
'mimetype': 'video/mp4'
|
||||
},
|
||||
@@ -111,7 +111,7 @@ def test_imgur_extension_validation_bad(test_extension: str):
|
||||
),
|
||||
(
|
||||
'https://imgur.com/gallery/IjJJdlC',
|
||||
('7227d4312a9779b74302724a0cfa9081',),
|
||||
('740b006cf9ec9d6f734b6e8f5130bdab',),
|
||||
),
|
||||
(
|
||||
'https://imgur.com/a/dcc84Gt',
|
||||
@@ -122,6 +122,42 @@ def test_imgur_extension_validation_bad(test_extension: str):
|
||||
'029c475ce01b58fdf1269d8771d33913',
|
||||
),
|
||||
),
|
||||
(
|
||||
'https://imgur.com/a/eemHCCK',
|
||||
(
|
||||
'9cb757fd8f055e7ef7aa88addc9d9fa5',
|
||||
'b6cb6c918e2544e96fb7c07d828774b5',
|
||||
'fb6c913d721c0bbb96aa65d7f560d385',
|
||||
),
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/lFJai6i.gifv',
|
||||
('01a6e79a30bec0e644e5da12365d5071',),
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/ywSyILa.gifv?',
|
||||
('56d4afc32d2966017c38d98568709b45',),
|
||||
),
|
||||
(
|
||||
'https://imgur.com/ubYwpbk.GIFV',
|
||||
('d4a774aac1667783f9ed3a1bd02fac0c',),
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/j1CNCZY.gifv',
|
||||
('58e7e6d972058c18b7ecde910ca147e3',),
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/uTvtQsw.gifv',
|
||||
('46c86533aa60fc0e09f2a758513e3ac2',),
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/OGeVuAe.giff',
|
||||
('77389679084d381336f168538793f218',)
|
||||
),
|
||||
(
|
||||
'https://i.imgur.com/OGeVuAe.gift',
|
||||
('77389679084d381336f168538793f218',)
|
||||
),
|
||||
))
|
||||
def test_find_resources(test_url: str, expected_hashes: list[str]):
|
||||
mock_download = Mock()
|
||||
@@ -129,7 +165,6 @@ def test_find_resources(test_url: str, expected_hashes: list[str]):
|
||||
downloader = Imgur(mock_download)
|
||||
results = downloader.find_resources()
|
||||
assert all([isinstance(res, Resource) for res in results])
|
||||
[res.download(120) for res in results]
|
||||
[res.download() for res in results]
|
||||
hashes = set([res.hash.hexdigest() for res in results])
|
||||
assert len(results) == len(expected_hashes)
|
||||
assert hashes == set(expected_hashes)
|
||||
|
||||
39
tests/site_downloaders/test_pornhub.py
Normal file
39
tests/site_downloaders/test_pornhub.py
Normal file
@@ -0,0 +1,39 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from bdfr.exceptions import SiteDownloaderError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_downloaders.pornhub import PornHub
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
|
||||
('https://www.pornhub.com/view_video.php?viewkey=ph6074c59798497', 'd9b99e4ebecf2d8d67efe5e70d2acf8a'),
|
||||
('https://www.pornhub.com/view_video.php?viewkey=ph5ede121f0d3f8', ''),
|
||||
))
|
||||
def test_find_resources_good(test_url: str, expected_hash: str):
|
||||
test_submission = MagicMock()
|
||||
test_submission.url = test_url
|
||||
downloader = PornHub(test_submission)
|
||||
resources = downloader.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
resources[0].download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize('test_url', (
|
||||
'https://www.pornhub.com/view_video.php?viewkey=ph5ede121f0d3f8',
|
||||
))
|
||||
def test_find_resources_good(test_url: str):
|
||||
test_submission = MagicMock()
|
||||
test_submission.url = test_url
|
||||
downloader = PornHub(test_submission)
|
||||
with pytest.raises(SiteDownloaderError):
|
||||
downloader.find_resources()
|
||||
@@ -15,10 +15,8 @@ from bdfr.site_downloaders.redgifs import Redgifs
|
||||
'https://thumbs2.redgifs.com/FrighteningVictoriousSalamander.mp4'),
|
||||
('https://redgifs.com/watch/springgreendecisivetaruca',
|
||||
'https://thumbs2.redgifs.com/SpringgreenDecisiveTaruca.mp4'),
|
||||
('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer',
|
||||
'https://thumbs2.redgifs.com/RegalShoddyHorsechestnutleafminer.mp4'),
|
||||
('https://www.gifdeliverynetwork.com/maturenexthippopotamus',
|
||||
'https://thumbs2.redgifs.com/MatureNextHippopotamus.mp4'),
|
||||
('https://www.redgifs.com/watch/palegoldenrodrawhalibut',
|
||||
'https://thumbs2.redgifs.com/PalegoldenrodRawHalibut.mp4'),
|
||||
))
|
||||
def test_get_link(test_url: str, expected: str):
|
||||
result = Redgifs._get_link(test_url)
|
||||
@@ -29,8 +27,8 @@ def test_get_link(test_url: str, expected: str):
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
|
||||
('https://redgifs.com/watch/frighteningvictorioussalamander', '4007c35d9e1f4b67091b5f12cffda00a'),
|
||||
('https://redgifs.com/watch/springgreendecisivetaruca', '8dac487ac49a1f18cc1b4dabe23f0869'),
|
||||
('https://www.gifdeliverynetwork.com/maturenexthippopotamus', '9bec0a9e4163a43781368ed5d70471df'),
|
||||
('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', '8afb4e2c090a87140230f2352bf8beba'),
|
||||
('https://redgifs.com/watch/leafysaltydungbeetle', '076792c660b9c024c0471ef4759af8bd'),
|
||||
('https://www.redgifs.com/watch/palegoldenrodrawhalibut', '46d5aa77fe80c6407de1ecc92801c10e'),
|
||||
))
|
||||
def test_download_resource(test_url: str, expected_hash: str):
|
||||
mock_submission = Mock()
|
||||
@@ -39,5 +37,5 @@ def test_download_resource(test_url: str, expected_hash: str):
|
||||
resources = test_site.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
resources[0].download(120)
|
||||
resources[0].download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
|
||||
73
tests/site_downloaders/test_vidble.py
Normal file
73
tests/site_downloaders/test_vidble.py
Normal file
@@ -0,0 +1,73 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
from unittest.mock import Mock
|
||||
|
||||
import pytest
|
||||
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_downloaders.vidble import Vidble
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_url', 'expected'), (
|
||||
('/RDFbznUvcN_med.jpg', '/RDFbznUvcN.jpg'),
|
||||
))
|
||||
def test_change_med_url(test_url: str, expected: str):
|
||||
result = Vidble.change_med_url(test_url)
|
||||
assert result == expected
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize(('test_url', 'expected'), (
|
||||
('https://www.vidble.com/show/UxsvAssYe5', {
|
||||
'https://www.vidble.com/UxsvAssYe5.gif',
|
||||
}),
|
||||
('https://vidble.com/show/RDFbznUvcN', {
|
||||
'https://www.vidble.com/RDFbznUvcN.jpg',
|
||||
}),
|
||||
('https://vidble.com/album/h0jTLs6B', {
|
||||
'https://www.vidble.com/XG4eAoJ5JZ.jpg',
|
||||
'https://www.vidble.com/IqF5UdH6Uq.jpg',
|
||||
'https://www.vidble.com/VWuNsnLJMD.jpg',
|
||||
'https://www.vidble.com/sMmM8O650W.jpg',
|
||||
}),
|
||||
('https://vidble.com/watch?v=0q4nWakqM6kzQWxlePD8N62Dsflev0N9', {
|
||||
'https://www.vidble.com/0q4nWakqM6kzQWxlePD8N62Dsflev0N9.mp4',
|
||||
}),
|
||||
('https://www.vidble.com/pHuwWkOcEb', {
|
||||
'https://www.vidble.com/pHuwWkOcEb.jpg',
|
||||
}),
|
||||
))
|
||||
def test_get_links(test_url: str, expected: set[str]):
|
||||
results = Vidble.get_links(test_url)
|
||||
assert results == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hashes'), (
|
||||
('https://www.vidble.com/show/UxsvAssYe5', {
|
||||
'0ef2f8e0e0b45936d2fb3e6fbdf67e28',
|
||||
}),
|
||||
('https://vidble.com/show/RDFbznUvcN', {
|
||||
'c2dd30a71e32369c50eed86f86efff58',
|
||||
}),
|
||||
('https://vidble.com/album/h0jTLs6B', {
|
||||
'3b3cba02e01c91f9858a95240b942c71',
|
||||
'dd6ecf5fc9e936f9fb614eb6a0537f99',
|
||||
'b31a942cd8cdda218ed547bbc04c3a27',
|
||||
'6f77c570b451eef4222804bd52267481',
|
||||
}),
|
||||
('https://vidble.com/watch?v=0q4nWakqM6kzQWxlePD8N62Dsflev0N9', {
|
||||
'cebe9d5f24dba3b0443e5097f160ca83',
|
||||
}),
|
||||
('https://www.vidble.com/pHuwWkOcEb', {
|
||||
'585f486dd0b2f23a57bddbd5bf185bc7',
|
||||
}),
|
||||
))
|
||||
def test_find_resources(test_url: str, expected_hashes: set[str]):
|
||||
mock_download = Mock()
|
||||
mock_download.url = test_url
|
||||
downloader = Vidble(mock_download)
|
||||
results = downloader.find_resources()
|
||||
assert all([isinstance(res, Resource) for res in results])
|
||||
[res.download() for res in results]
|
||||
hashes = set([res.hash.hexdigest() for res in results])
|
||||
assert hashes == set(expected_hashes)
|
||||
@@ -5,6 +5,7 @@ from unittest.mock import MagicMock
|
||||
|
||||
import pytest
|
||||
|
||||
from bdfr.exceptions import NotADownloadableLinkError
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_downloaders.youtube import Youtube
|
||||
|
||||
@@ -12,15 +13,29 @@ from bdfr.site_downloaders.youtube import Youtube
|
||||
@pytest.mark.online
|
||||
@pytest.mark.slow
|
||||
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
|
||||
('https://www.youtube.com/watch?v=uSm2VDgRIUs', 'f70b704b4b78b9bb5cd032bfc26e4971'),
|
||||
('https://www.youtube.com/watch?v=m-tKnjFwleU', '30314930d853afff8ebc7d8c36a5b833'),
|
||||
('https://www.youtube.com/watch?v=uSm2VDgRIUs', '2d60b54582df5b95ec72bb00b580d2ff'),
|
||||
('https://www.youtube.com/watch?v=GcI7nxQj7HA', '5db0fc92a0a7fb9ac91e63505eea9cf0'),
|
||||
('https://youtu.be/TMqPOlp4tNo', 'f68c00b018162857f3df4844c45302e7'), # Age restricted
|
||||
))
|
||||
def test_find_resources(test_url: str, expected_hash: str):
|
||||
def test_find_resources_good(test_url: str, expected_hash: str):
|
||||
test_submission = MagicMock()
|
||||
test_submission.url = test_url
|
||||
downloader = Youtube(test_submission)
|
||||
resources = downloader.find_resources()
|
||||
assert len(resources) == 1
|
||||
assert isinstance(resources[0], Resource)
|
||||
resources[0].download(120)
|
||||
resources[0].download()
|
||||
assert resources[0].hash.hexdigest() == expected_hash
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.parametrize('test_url', (
|
||||
'https://www.polygon.com/disney-plus/2020/5/14/21249881/gargoyles-animated-series-disney-plus-greg-weisman'
|
||||
'-interview-oj-simpson-goliath-chronicles',
|
||||
))
|
||||
def test_find_resources_bad(test_url: str):
|
||||
test_submission = MagicMock()
|
||||
test_submission.url = test_url
|
||||
downloader = Youtube(test_submission)
|
||||
with pytest.raises(NotADownloadableLinkError):
|
||||
downloader.find_resources()
|
||||
|
||||
@@ -7,51 +7,20 @@ from unittest.mock import MagicMock
|
||||
import praw
|
||||
import pytest
|
||||
|
||||
from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
|
||||
from bdfr.archiver import Archiver
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_id', (
|
||||
'm3reby',
|
||||
@pytest.mark.parametrize(('test_submission_id', 'test_format'), (
|
||||
('m3reby', 'xml'),
|
||||
('m3reby', 'json'),
|
||||
('m3reby', 'yaml'),
|
||||
))
|
||||
def test_write_submission_json(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
|
||||
def test_write_submission_json(test_submission_id: str, tmp_path: Path, test_format: str, reddit_instance: praw.Reddit):
|
||||
archiver_mock = MagicMock()
|
||||
test_path = Path(tmp_path, 'test.json')
|
||||
archiver_mock.args.format = test_format
|
||||
test_path = Path(tmp_path, 'test')
|
||||
test_submission = reddit_instance.submission(id=test_submission_id)
|
||||
archiver_mock.file_name_formatter.format_path.return_value = test_path
|
||||
test_entry = SubmissionArchiveEntry(test_submission)
|
||||
Archiver._write_entry_json(archiver_mock, test_entry)
|
||||
archiver_mock._write_content_to_disk.assert_called_once()
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_id', (
|
||||
'm3reby',
|
||||
))
|
||||
def test_write_submission_xml(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
|
||||
archiver_mock = MagicMock()
|
||||
test_path = Path(tmp_path, 'test.xml')
|
||||
test_submission = reddit_instance.submission(id=test_submission_id)
|
||||
archiver_mock.file_name_formatter.format_path.return_value = test_path
|
||||
test_entry = SubmissionArchiveEntry(test_submission)
|
||||
Archiver._write_entry_xml(archiver_mock, test_entry)
|
||||
archiver_mock._write_content_to_disk.assert_called_once()
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_id', (
|
||||
'm3reby',
|
||||
))
|
||||
def test_write_submission_yaml(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
|
||||
archiver_mock = MagicMock()
|
||||
archiver_mock.download_directory = tmp_path
|
||||
test_path = Path(tmp_path, 'test.yaml')
|
||||
test_submission = reddit_instance.submission(id=test_submission_id)
|
||||
archiver_mock.file_name_formatter.format_path.return_value = test_path
|
||||
test_entry = SubmissionArchiveEntry(test_submission)
|
||||
Archiver._write_entry_yaml(archiver_mock, test_entry)
|
||||
archiver_mock._write_content_to_disk.assert_called_once()
|
||||
Archiver.write_entry(archiver_mock, test_submission)
|
||||
|
||||
452
tests/test_connector.py
Normal file
452
tests/test_connector.py
Normal file
@@ -0,0 +1,452 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
from unittest.mock import MagicMock
|
||||
|
||||
import praw
|
||||
import praw.models
|
||||
import pytest
|
||||
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.connector import RedditConnector, RedditTypes
|
||||
from bdfr.download_filter import DownloadFilter
|
||||
from bdfr.exceptions import BulkDownloaderException
|
||||
from bdfr.file_name_formatter import FileNameFormatter
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def args() -> Configuration:
|
||||
args = Configuration()
|
||||
args.time_format = 'ISO'
|
||||
return args
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def downloader_mock(args: Configuration):
|
||||
downloader_mock = MagicMock()
|
||||
downloader_mock.args = args
|
||||
downloader_mock.sanitise_subreddit_name = RedditConnector.sanitise_subreddit_name
|
||||
downloader_mock.create_filtered_listing_generator = lambda x: RedditConnector.create_filtered_listing_generator(
|
||||
downloader_mock, x)
|
||||
downloader_mock.split_args_input = RedditConnector.split_args_input
|
||||
downloader_mock.master_hash_list = {}
|
||||
return downloader_mock
|
||||
|
||||
|
||||
def assert_all_results_are_submissions(result_limit: int, results: list[Iterator]) -> list:
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
if result_limit is not None:
|
||||
assert len(results) == result_limit
|
||||
return results
|
||||
|
||||
|
||||
def assert_all_results_are_submissions_or_comments(result_limit: int, results: list[Iterator]) -> list:
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) or isinstance(res, praw.models.Comment) for res in results])
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
if result_limit is not None:
|
||||
assert len(results) == result_limit
|
||||
return results
|
||||
|
||||
|
||||
def test_determine_directories(tmp_path: Path, downloader_mock: MagicMock):
|
||||
downloader_mock.args.directory = tmp_path / 'test'
|
||||
downloader_mock.config_directories.user_config_dir = tmp_path
|
||||
RedditConnector.determine_directories(downloader_mock)
|
||||
assert Path(tmp_path / 'test').exists()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('skip_extensions', 'skip_domains'), (
|
||||
([], []),
|
||||
(['.test'], ['test.com'],),
|
||||
))
|
||||
def test_create_download_filter(skip_extensions: list[str], skip_domains: list[str], downloader_mock: MagicMock):
|
||||
downloader_mock.args.skip = skip_extensions
|
||||
downloader_mock.args.skip_domain = skip_domains
|
||||
result = RedditConnector.create_download_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, DownloadFilter)
|
||||
assert result.excluded_domains == skip_domains
|
||||
assert result.excluded_extensions == skip_extensions
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_time', 'expected'), (
|
||||
('all', 'all'),
|
||||
('hour', 'hour'),
|
||||
('day', 'day'),
|
||||
('week', 'week'),
|
||||
('random', 'all'),
|
||||
('', 'all'),
|
||||
))
|
||||
def test_create_time_filter(test_time: str, expected: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.time = test_time
|
||||
result = RedditConnector.create_time_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, RedditTypes.TimeType)
|
||||
assert result.name.lower() == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_sort', 'expected'), (
|
||||
('', 'hot'),
|
||||
('hot', 'hot'),
|
||||
('controversial', 'controversial'),
|
||||
('new', 'new'),
|
||||
))
|
||||
def test_create_sort_filter(test_sort: str, expected: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.sort = test_sort
|
||||
result = RedditConnector.create_sort_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, RedditTypes.SortType)
|
||||
assert result.name.lower() == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
|
||||
('{POSTID}', '{SUBREDDIT}'),
|
||||
('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}'),
|
||||
('{POSTID}', 'test'),
|
||||
('{POSTID}', ''),
|
||||
('{POSTID}', '{SUBREDDIT}/{REDDITOR}'),
|
||||
))
|
||||
def test_create_file_name_formatter(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.file_scheme = test_file_scheme
|
||||
downloader_mock.args.folder_scheme = test_folder_scheme
|
||||
result = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
|
||||
assert isinstance(result, FileNameFormatter)
|
||||
assert result.file_format_string == test_file_scheme
|
||||
assert result.directory_format_string == test_folder_scheme.split('/')
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
|
||||
('', ''),
|
||||
('', '{SUBREDDIT}'),
|
||||
('test', '{SUBREDDIT}'),
|
||||
))
|
||||
def test_create_file_name_formatter_bad(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.file_scheme = test_file_scheme
|
||||
downloader_mock.args.folder_scheme = test_folder_scheme
|
||||
with pytest.raises(BulkDownloaderException):
|
||||
RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
|
||||
|
||||
def test_create_authenticator(downloader_mock: MagicMock):
|
||||
result = RedditConnector.create_authenticator(downloader_mock)
|
||||
assert isinstance(result, SiteAuthenticator)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_ids', (
|
||||
('lvpf4l',),
|
||||
('lvpf4l', 'lvqnsn'),
|
||||
('lvpf4l', 'lvqnsn', 'lvl9kd'),
|
||||
))
|
||||
def test_get_submissions_from_link(
|
||||
test_submission_ids: list[str],
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock):
|
||||
downloader_mock.args.link = test_submission_ids
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
results = RedditConnector.get_submissions_from_link(downloader_mock)
|
||||
assert all([isinstance(sub, praw.models.Submission) for res in results for sub in res])
|
||||
assert len(results[0]) == len(test_submission_ids)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddits', 'limit', 'sort_type', 'time_filter', 'max_expected_len'), (
|
||||
(('Futurology',), 10, 'hot', 'all', 10),
|
||||
(('Futurology', 'Mindustry, Python'), 10, 'hot', 'all', 30),
|
||||
(('Futurology',), 20, 'hot', 'all', 20),
|
||||
(('Futurology', 'Python'), 10, 'hot', 'all', 20),
|
||||
(('Futurology',), 100, 'hot', 'all', 100),
|
||||
(('Futurology',), 0, 'hot', 'all', 0),
|
||||
(('Futurology',), 10, 'top', 'all', 10),
|
||||
(('Futurology',), 10, 'top', 'week', 10),
|
||||
(('Futurology',), 10, 'hot', 'week', 10),
|
||||
))
|
||||
def test_get_subreddit_normal(
|
||||
test_subreddits: list[str],
|
||||
limit: int,
|
||||
sort_type: str,
|
||||
time_filter: str,
|
||||
max_expected_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.sort = sort_type
|
||||
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
|
||||
downloader_mock.sort_filter = RedditConnector.create_sort_filter(downloader_mock)
|
||||
downloader_mock.determine_sort_function.return_value = RedditConnector.determine_sort_function(downloader_mock)
|
||||
downloader_mock.args.subreddit = test_subreddits
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
results = RedditConnector.get_subreddits(downloader_mock)
|
||||
test_subreddits = downloader_mock.split_args_input(test_subreddits)
|
||||
results = [sub for res1 in results for sub in res1]
|
||||
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
|
||||
assert all([res.subreddit.display_name in test_subreddits for res in results])
|
||||
assert len(results) <= max_expected_len
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_time', 'test_delta'), (
|
||||
('hour', timedelta(hours=1)),
|
||||
('day', timedelta(days=1)),
|
||||
('week', timedelta(days=7)),
|
||||
('month', timedelta(days=31)),
|
||||
('year', timedelta(days=365)),
|
||||
))
|
||||
def test_get_subreddit_time_verification(
|
||||
test_time: str,
|
||||
test_delta: timedelta,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock.args.limit = 10
|
||||
downloader_mock.args.sort = 'top'
|
||||
downloader_mock.args.time = test_time
|
||||
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
|
||||
downloader_mock.sort_filter = RedditConnector.create_sort_filter(downloader_mock)
|
||||
downloader_mock.determine_sort_function.return_value = RedditConnector.determine_sort_function(downloader_mock)
|
||||
downloader_mock.args.subreddit = ['all']
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
results = RedditConnector.get_subreddits(downloader_mock)
|
||||
results = [sub for res1 in results for sub in res1]
|
||||
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
|
||||
nowtime = datetime.now()
|
||||
for r in results:
|
||||
result_time = datetime.fromtimestamp(r.created_utc)
|
||||
time_diff = nowtime - result_time
|
||||
assert time_diff < test_delta
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddits', 'search_term', 'limit', 'time_filter', 'max_expected_len'), (
|
||||
(('Python',), 'scraper', 10, 'all', 10),
|
||||
(('Python',), '', 10, 'all', 0),
|
||||
(('Python',), 'djsdsgewef', 10, 'all', 0),
|
||||
(('Python',), 'scraper', 10, 'year', 10),
|
||||
))
|
||||
def test_get_subreddit_search(
|
||||
test_subreddits: list[str],
|
||||
search_term: str,
|
||||
time_filter: str,
|
||||
limit: int,
|
||||
max_expected_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.search = search_term
|
||||
downloader_mock.args.subreddit = test_subreddits
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.time = time_filter
|
||||
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
|
||||
results = RedditConnector.get_subreddits(downloader_mock)
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
assert all([res.subreddit.display_name in test_subreddits for res in results])
|
||||
assert len(results) <= max_expected_len
|
||||
if max_expected_len != 0:
|
||||
assert len(results) > 0
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_user', 'test_multireddits', 'limit'), (
|
||||
('helen_darten', ('cuteanimalpics',), 10),
|
||||
('korfor', ('chess',), 100),
|
||||
))
|
||||
# Good sources at https://www.reddit.com/r/multihub/
|
||||
def test_get_multireddits_public(
|
||||
test_user: str,
|
||||
test_multireddits: list[str],
|
||||
limit: int,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.multireddit = test_multireddits
|
||||
downloader_mock.args.user = [test_user]
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.create_filtered_listing_generator.return_value = \
|
||||
RedditConnector.create_filtered_listing_generator(
|
||||
downloader_mock,
|
||||
reddit_instance.multireddit(test_user, test_multireddits[0]),
|
||||
)
|
||||
results = RedditConnector.get_multireddits(downloader_mock)
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
assert len(results) == limit
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_user', 'limit'), (
|
||||
('danigirl3694', 10),
|
||||
('danigirl3694', 50),
|
||||
('CapitanHam', None),
|
||||
))
|
||||
def test_get_user_submissions(test_user: str, limit: int, downloader_mock: MagicMock, reddit_instance: praw.Reddit):
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.submitted = True
|
||||
downloader_mock.args.user = [test_user]
|
||||
downloader_mock.authenticated = False
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.create_filtered_listing_generator.return_value = \
|
||||
RedditConnector.create_filtered_listing_generator(
|
||||
downloader_mock,
|
||||
reddit_instance.redditor(test_user).submissions,
|
||||
)
|
||||
results = RedditConnector.get_user_data(downloader_mock)
|
||||
results = assert_all_results_are_submissions(limit, results)
|
||||
assert all([res.author.name == test_user for res in results])
|
||||
assert not any([isinstance(m, MagicMock) for m in results])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.authenticated
|
||||
@pytest.mark.parametrize('test_flag', (
|
||||
'upvoted',
|
||||
'saved',
|
||||
))
|
||||
def test_get_user_authenticated_lists(
|
||||
test_flag: str,
|
||||
downloader_mock: MagicMock,
|
||||
authenticated_reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock.args.__dict__[test_flag] = True
|
||||
downloader_mock.reddit_instance = authenticated_reddit_instance
|
||||
downloader_mock.args.limit = 10
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.user = [RedditConnector.resolve_user_name(downloader_mock, 'me')]
|
||||
results = RedditConnector.get_user_data(downloader_mock)
|
||||
assert_all_results_are_submissions_or_comments(10, results)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_name', 'expected'), (
|
||||
('Mindustry', 'Mindustry'),
|
||||
('Futurology', 'Futurology'),
|
||||
('r/Mindustry', 'Mindustry'),
|
||||
('TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('r/TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/TrollXChromosomes/', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/Futurology/', 'Futurology'),
|
||||
('https://www.reddit.com/r/Futurology', 'Futurology'),
|
||||
))
|
||||
def test_sanitise_subreddit_name(test_name: str, expected: str):
|
||||
result = RedditConnector.sanitise_subreddit_name(test_name)
|
||||
assert result == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_subreddit_entries', 'expected'), (
|
||||
(['test1', 'test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1,test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1, test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1; test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1, test2', 'test1,test2,test3', 'test4'], {'test1', 'test2', 'test3', 'test4'}),
|
||||
([''], {''}),
|
||||
(['test'], {'test'}),
|
||||
))
|
||||
def test_split_subreddit_entries(test_subreddit_entries: list[str], expected: set[str]):
|
||||
results = RedditConnector.split_args_input(test_subreddit_entries)
|
||||
assert results == expected
|
||||
|
||||
|
||||
def test_read_submission_ids_from_file(downloader_mock: MagicMock, tmp_path: Path):
|
||||
test_file = tmp_path / 'test.txt'
|
||||
test_file.write_text('aaaaaa\nbbbbbb')
|
||||
results = RedditConnector.read_id_files([str(test_file)])
|
||||
assert results == {'aaaaaa', 'bbbbbb'}
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'Paracortex',
|
||||
'crowdstrike',
|
||||
'HannibalGoddamnit',
|
||||
))
|
||||
def test_check_user_existence_good(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'lhnhfkuhwreolo',
|
||||
'adlkfmnhglojh',
|
||||
))
|
||||
def test_check_user_existence_nonexistent(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
with pytest.raises(BulkDownloaderException, match='Could not find'):
|
||||
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'Bree-Boo',
|
||||
))
|
||||
def test_check_user_existence_banned(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
with pytest.raises(BulkDownloaderException, match='is banned'):
|
||||
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddit_name', 'expected_message'), (
|
||||
('donaldtrump', 'cannot be found'),
|
||||
('submitters', 'private and cannot be scraped')
|
||||
))
|
||||
def test_check_subreddit_status_bad(test_subreddit_name: str, expected_message: str, reddit_instance: praw.Reddit):
|
||||
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
|
||||
with pytest.raises(BulkDownloaderException, match=expected_message):
|
||||
RedditConnector.check_subreddit_status(test_subreddit)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_subreddit_name', (
|
||||
'Python',
|
||||
'Mindustry',
|
||||
'TrollXChromosomes',
|
||||
'all',
|
||||
))
|
||||
def test_check_subreddit_status_good(test_subreddit_name: str, reddit_instance: praw.Reddit):
|
||||
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
|
||||
RedditConnector.check_subreddit_status(test_subreddit)
|
||||
@@ -46,7 +46,7 @@ def test_filter_domain(test_url: str, expected: bool, download_filter: DownloadF
|
||||
('http://reddit.com/test.gif', False),
|
||||
))
|
||||
def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilter):
|
||||
test_resource = Resource(MagicMock(), test_url)
|
||||
test_resource = Resource(MagicMock(), test_url, lambda: None)
|
||||
result = download_filter.check_resource(test_resource)
|
||||
assert result == expected
|
||||
|
||||
@@ -59,6 +59,6 @@ def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilt
|
||||
))
|
||||
def test_filter_empty_filter(test_url: str):
|
||||
download_filter = DownloadFilter()
|
||||
test_resource = Resource(MagicMock(), test_url)
|
||||
test_resource = Resource(MagicMock(), test_url, lambda: None)
|
||||
result = download_filter.check_resource(test_resource)
|
||||
assert result is True
|
||||
|
||||
@@ -1,22 +1,18 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import os
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Iterator
|
||||
from unittest.mock import MagicMock
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import praw
|
||||
import praw.models
|
||||
import pytest
|
||||
|
||||
from bdfr.__main__ import setup_logging
|
||||
from bdfr.configuration import Configuration
|
||||
from bdfr.download_filter import DownloadFilter
|
||||
from bdfr.downloader import RedditDownloader, RedditTypes
|
||||
from bdfr.exceptions import BulkDownloaderException
|
||||
from bdfr.file_name_formatter import FileNameFormatter
|
||||
from bdfr.site_authenticator import SiteAuthenticator
|
||||
from bdfr.connector import RedditConnector
|
||||
from bdfr.downloader import RedditDownloader
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
@@ -30,314 +26,105 @@ def args() -> Configuration:
|
||||
def downloader_mock(args: Configuration):
|
||||
downloader_mock = MagicMock()
|
||||
downloader_mock.args = args
|
||||
downloader_mock._sanitise_subreddit_name = RedditDownloader._sanitise_subreddit_name
|
||||
downloader_mock._split_args_input = RedditDownloader._split_args_input
|
||||
downloader_mock._sanitise_subreddit_name = RedditConnector.sanitise_subreddit_name
|
||||
downloader_mock._split_args_input = RedditConnector.split_args_input
|
||||
downloader_mock.master_hash_list = {}
|
||||
return downloader_mock
|
||||
|
||||
|
||||
def assert_all_results_are_submissions(result_limit: int, results: list[Iterator]):
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
if result_limit is not None:
|
||||
assert len(results) == result_limit
|
||||
return results
|
||||
|
||||
|
||||
def test_determine_directories(tmp_path: Path, downloader_mock: MagicMock):
|
||||
downloader_mock.args.directory = tmp_path / 'test'
|
||||
downloader_mock.config_directories.user_config_dir = tmp_path
|
||||
RedditDownloader._determine_directories(downloader_mock)
|
||||
assert Path(tmp_path / 'test').exists()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('skip_extensions', 'skip_domains'), (
|
||||
([], []),
|
||||
(['.test'], ['test.com'],),
|
||||
@pytest.mark.parametrize(('test_ids', 'test_excluded', 'expected_len'), (
|
||||
(('aaaaaa',), (), 1),
|
||||
(('aaaaaa',), ('aaaaaa',), 0),
|
||||
((), ('aaaaaa',), 0),
|
||||
(('aaaaaa', 'bbbbbb'), ('aaaaaa',), 1),
|
||||
(('aaaaaa', 'bbbbbb', 'cccccc'), ('aaaaaa',), 2),
|
||||
))
|
||||
def test_create_download_filter(skip_extensions: list[str], skip_domains: list[str], downloader_mock: MagicMock):
|
||||
downloader_mock.args.skip = skip_extensions
|
||||
downloader_mock.args.skip_domain = skip_domains
|
||||
result = RedditDownloader._create_download_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, DownloadFilter)
|
||||
assert result.excluded_domains == skip_domains
|
||||
assert result.excluded_extensions == skip_extensions
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_time', 'expected'), (
|
||||
('all', 'all'),
|
||||
('hour', 'hour'),
|
||||
('day', 'day'),
|
||||
('week', 'week'),
|
||||
('random', 'all'),
|
||||
('', 'all'),
|
||||
))
|
||||
def test_create_time_filter(test_time: str, expected: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.time = test_time
|
||||
result = RedditDownloader._create_time_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, RedditTypes.TimeType)
|
||||
assert result.name.lower() == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_sort', 'expected'), (
|
||||
('', 'hot'),
|
||||
('hot', 'hot'),
|
||||
('controversial', 'controversial'),
|
||||
('new', 'new'),
|
||||
))
|
||||
def test_create_sort_filter(test_sort: str, expected: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.sort = test_sort
|
||||
result = RedditDownloader._create_sort_filter(downloader_mock)
|
||||
|
||||
assert isinstance(result, RedditTypes.SortType)
|
||||
assert result.name.lower() == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
|
||||
('{POSTID}', '{SUBREDDIT}'),
|
||||
('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}'),
|
||||
('{POSTID}', 'test'),
|
||||
('{POSTID}', ''),
|
||||
('{POSTID}', '{SUBREDDIT}/{REDDITOR}'),
|
||||
))
|
||||
def test_create_file_name_formatter(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.file_scheme = test_file_scheme
|
||||
downloader_mock.args.folder_scheme = test_folder_scheme
|
||||
result = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
|
||||
assert isinstance(result, FileNameFormatter)
|
||||
assert result.file_format_string == test_file_scheme
|
||||
assert result.directory_format_string == test_folder_scheme.split('/')
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
|
||||
('', ''),
|
||||
('', '{SUBREDDIT}'),
|
||||
('test', '{SUBREDDIT}'),
|
||||
))
|
||||
def test_create_file_name_formatter_bad(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
|
||||
downloader_mock.args.file_scheme = test_file_scheme
|
||||
downloader_mock.args.folder_scheme = test_folder_scheme
|
||||
with pytest.raises(BulkDownloaderException):
|
||||
RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
|
||||
|
||||
def test_create_authenticator(downloader_mock: MagicMock):
|
||||
result = RedditDownloader._create_authenticator(downloader_mock)
|
||||
assert isinstance(result, SiteAuthenticator)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_ids', (
|
||||
('lvpf4l',),
|
||||
('lvpf4l', 'lvqnsn'),
|
||||
('lvpf4l', 'lvqnsn', 'lvl9kd'),
|
||||
))
|
||||
def test_get_submissions_from_link(
|
||||
test_submission_ids: list[str],
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock):
|
||||
downloader_mock.args.link = test_submission_ids
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
results = RedditDownloader._get_submissions_from_link(downloader_mock)
|
||||
assert all([isinstance(sub, praw.models.Submission) for res in results for sub in res])
|
||||
assert len(results[0]) == len(test_submission_ids)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddits', 'limit', 'sort_type', 'time_filter', 'max_expected_len'), (
|
||||
(('Futurology',), 10, 'hot', 'all', 10),
|
||||
(('Futurology', 'Mindustry, Python'), 10, 'hot', 'all', 30),
|
||||
(('Futurology',), 20, 'hot', 'all', 20),
|
||||
(('Futurology', 'Python'), 10, 'hot', 'all', 20),
|
||||
(('Futurology',), 100, 'hot', 'all', 100),
|
||||
(('Futurology',), 0, 'hot', 'all', 0),
|
||||
(('Futurology',), 10, 'top', 'all', 10),
|
||||
(('Futurology',), 10, 'top', 'week', 10),
|
||||
(('Futurology',), 10, 'hot', 'week', 10),
|
||||
))
|
||||
def test_get_subreddit_normal(
|
||||
test_subreddits: list[str],
|
||||
limit: int,
|
||||
sort_type: str,
|
||||
time_filter: str,
|
||||
max_expected_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.sort = sort_type
|
||||
downloader_mock.args.subreddit = test_subreddits
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.sort_filter = RedditDownloader._create_sort_filter(downloader_mock)
|
||||
results = RedditDownloader._get_subreddits(downloader_mock)
|
||||
test_subreddits = downloader_mock._split_args_input(test_subreddits)
|
||||
results = [sub for res1 in results for sub in res1]
|
||||
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
|
||||
assert all([res.subreddit.display_name in test_subreddits for res in results])
|
||||
assert len(results) <= max_expected_len
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddits', 'search_term', 'limit', 'time_filter', 'max_expected_len'), (
|
||||
(('Python',), 'scraper', 10, 'all', 10),
|
||||
(('Python',), '', 10, 'all', 10),
|
||||
(('Python',), 'djsdsgewef', 10, 'all', 0),
|
||||
(('Python',), 'scraper', 10, 'year', 10),
|
||||
(('Python',), 'scraper', 10, 'hour', 1),
|
||||
))
|
||||
def test_get_subreddit_search(
|
||||
test_subreddits: list[str],
|
||||
search_term: str,
|
||||
time_filter: str,
|
||||
limit: int,
|
||||
max_expected_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.search = search_term
|
||||
downloader_mock.args.subreddit = test_subreddits
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.time = time_filter
|
||||
downloader_mock.time_filter = RedditDownloader._create_time_filter(downloader_mock)
|
||||
results = RedditDownloader._get_subreddits(downloader_mock)
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
assert all([res.subreddit.display_name in test_subreddits for res in results])
|
||||
assert len(results) <= max_expected_len
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_user', 'test_multireddits', 'limit'), (
|
||||
('helen_darten', ('cuteanimalpics',), 10),
|
||||
('korfor', ('chess',), 100),
|
||||
))
|
||||
# Good sources at https://www.reddit.com/r/multihub/
|
||||
def test_get_multireddits_public(
|
||||
test_user: str,
|
||||
test_multireddits: list[str],
|
||||
limit: int,
|
||||
reddit_instance: praw.Reddit,
|
||||
@patch('bdfr.site_downloaders.download_factory.DownloadFactory.pull_lever')
|
||||
def test_excluded_ids(
|
||||
mock_function: MagicMock,
|
||||
test_ids: tuple[str],
|
||||
test_excluded: tuple[str],
|
||||
expected_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock.args.multireddit = test_multireddits
|
||||
downloader_mock.args.user = test_user
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock._create_filtered_listing_generator.return_value = \
|
||||
RedditDownloader._create_filtered_listing_generator(
|
||||
downloader_mock,
|
||||
reddit_instance.multireddit(test_user, test_multireddits[0]),
|
||||
)
|
||||
results = RedditDownloader._get_multireddits(downloader_mock)
|
||||
results = [sub for res in results for sub in res]
|
||||
assert all([isinstance(res, praw.models.Submission) for res in results])
|
||||
assert len(results) == limit
|
||||
downloader_mock.excluded_submission_ids = test_excluded
|
||||
mock_function.return_value = MagicMock()
|
||||
mock_function.return_value.__name__ = 'test'
|
||||
test_submissions = []
|
||||
for test_id in test_ids:
|
||||
m = MagicMock()
|
||||
m.id = test_id
|
||||
m.subreddit.display_name.return_value = 'https://www.example.com/'
|
||||
m.__class__ = praw.models.Submission
|
||||
test_submissions.append(m)
|
||||
downloader_mock.reddit_lists = [test_submissions]
|
||||
for submission in test_submissions:
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
assert mock_function.call_count == expected_len
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_user', 'limit'), (
|
||||
('danigirl3694', 10),
|
||||
('danigirl3694', 50),
|
||||
('CapitanHam', None),
|
||||
@pytest.mark.parametrize('test_submission_id', (
|
||||
'm1hqw6',
|
||||
))
|
||||
def test_get_user_submissions(test_user: str, limit: int, downloader_mock: MagicMock, reddit_instance: praw.Reddit):
|
||||
downloader_mock.args.limit = limit
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
downloader_mock.args.submitted = True
|
||||
downloader_mock.args.user = test_user
|
||||
downloader_mock.authenticated = False
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock._create_filtered_listing_generator.return_value = \
|
||||
RedditDownloader._create_filtered_listing_generator(
|
||||
downloader_mock,
|
||||
reddit_instance.redditor(test_user).submissions,
|
||||
)
|
||||
results = RedditDownloader._get_user_data(downloader_mock)
|
||||
results = assert_all_results_are_submissions(limit, results)
|
||||
assert all([res.author.name == test_user for res in results])
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.authenticated
|
||||
@pytest.mark.parametrize('test_flag', (
|
||||
'upvoted',
|
||||
'saved',
|
||||
))
|
||||
def test_get_user_authenticated_lists(
|
||||
test_flag: str,
|
||||
downloader_mock: MagicMock,
|
||||
authenticated_reddit_instance: praw.Reddit,
|
||||
):
|
||||
downloader_mock.args.__dict__[test_flag] = True
|
||||
downloader_mock.reddit_instance = authenticated_reddit_instance
|
||||
downloader_mock.args.user = 'me'
|
||||
downloader_mock.args.limit = 10
|
||||
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
|
||||
downloader_mock.sort_filter = RedditTypes.SortType.HOT
|
||||
RedditDownloader._resolve_user_name(downloader_mock)
|
||||
results = RedditDownloader._get_user_data(downloader_mock)
|
||||
assert_all_results_are_submissions(10, results)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_submission_id', 'expected_files_len'), (
|
||||
('ljyy27', 4),
|
||||
))
|
||||
def test_download_submission(
|
||||
def test_mark_hard_link(
|
||||
test_submission_id: str,
|
||||
expected_files_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
tmp_path: Path):
|
||||
tmp_path: Path,
|
||||
reddit_instance: praw.Reddit
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.download_filter.check_url.return_value = True
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.args.make_hard_links = True
|
||||
downloader_mock.download_directory = tmp_path
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.args.file_scheme = '{POSTID}'
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
|
||||
original = Path(tmp_path, f'{test_submission_id}.png')
|
||||
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
folder_contents = list(tmp_path.iterdir())
|
||||
assert len(folder_contents) == expected_files_len
|
||||
assert original.exists()
|
||||
|
||||
downloader_mock.args.file_scheme = 'test2_{POSTID}'
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
test_file_1_stats = original.stat()
|
||||
test_file_2_inode = Path(tmp_path, f'test2_{test_submission_id}.png').stat().st_ino
|
||||
|
||||
assert test_file_1_stats.st_nlink == 2
|
||||
assert test_file_1_stats.st_ino == test_file_2_inode
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
def test_download_submission_file_exists(
|
||||
@pytest.mark.parametrize(('test_submission_id', 'test_creation_date'), (
|
||||
('ndzz50', 1621204841.0),
|
||||
))
|
||||
def test_file_creation_date(
|
||||
test_submission_id: str,
|
||||
test_creation_date: float,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
tmp_path: Path,
|
||||
capsys: pytest.CaptureFixture
|
||||
reddit_instance: praw.Reddit
|
||||
):
|
||||
setup_logging(3)
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.download_filter.check_url.return_value = True
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.download_directory = tmp_path
|
||||
submission = downloader_mock.reddit_instance.submission(id='m1hqw6')
|
||||
Path(tmp_path, 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png').touch()
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.args.file_scheme = '{POSTID}'
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
|
||||
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
folder_contents = list(tmp_path.iterdir())
|
||||
output = capsys.readouterr()
|
||||
assert len(folder_contents) == 1
|
||||
assert 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png already exists' in output.out
|
||||
|
||||
for file_path in Path(tmp_path).iterdir():
|
||||
file_stats = os.stat(file_path)
|
||||
assert file_stats.st_mtime == test_creation_date
|
||||
|
||||
|
||||
def test_search_existing_files():
|
||||
results = RedditDownloader.scan_existing_files(Path('.'))
|
||||
assert len(results.keys()) != 0
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@@ -358,7 +145,7 @@ def test_download_submission_hash_exists(
|
||||
downloader_mock.download_filter.check_url.return_value = True
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.args.no_dupes = True
|
||||
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.download_directory = tmp_path
|
||||
downloader_mock.master_hash_list = {test_hash: None}
|
||||
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
|
||||
@@ -369,165 +156,47 @@ def test_download_submission_hash_exists(
|
||||
assert re.search(r'Resource hash .*? downloaded elsewhere', output.out)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_name', 'expected'), (
|
||||
('Mindustry', 'Mindustry'),
|
||||
('Futurology', 'Futurology'),
|
||||
('r/Mindustry', 'Mindustry'),
|
||||
('TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('r/TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/TrollXChromosomes/', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/TrollXChromosomes', 'TrollXChromosomes'),
|
||||
('https://www.reddit.com/r/Futurology/', 'Futurology'),
|
||||
('https://www.reddit.com/r/Futurology', 'Futurology'),
|
||||
))
|
||||
def test_sanitise_subreddit_name(test_name: str, expected: str):
|
||||
result = RedditDownloader._sanitise_subreddit_name(test_name)
|
||||
assert result == expected
|
||||
|
||||
|
||||
def test_search_existing_files():
|
||||
results = RedditDownloader.scan_existing_files(Path('.'))
|
||||
assert len(results.keys()) >= 40
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_subreddit_entries', 'expected'), (
|
||||
(['test1', 'test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1,test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1, test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1; test2', 'test3'], {'test1', 'test2', 'test3'}),
|
||||
(['test1, test2', 'test1,test2,test3', 'test4'], {'test1', 'test2', 'test3', 'test4'})
|
||||
))
|
||||
def test_split_subreddit_entries(test_subreddit_entries: list[str], expected: set[str]):
|
||||
results = RedditDownloader._split_args_input(test_subreddit_entries)
|
||||
assert results == expected
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_submission_id', (
|
||||
'm1hqw6',
|
||||
))
|
||||
def test_mark_hard_link(
|
||||
test_submission_id: str,
|
||||
def test_download_submission_file_exists(
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
tmp_path: Path,
|
||||
reddit_instance: praw.Reddit
|
||||
capsys: pytest.CaptureFixture
|
||||
):
|
||||
setup_logging(3)
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.args.make_hard_links = True
|
||||
downloader_mock.download_directory = tmp_path
|
||||
downloader_mock.download_filter.check_url.return_value = True
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.args.file_scheme = '{POSTID}'
|
||||
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.download_directory = tmp_path
|
||||
submission = downloader_mock.reddit_instance.submission(id='m1hqw6')
|
||||
Path(tmp_path, 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png').touch()
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
folder_contents = list(tmp_path.iterdir())
|
||||
output = capsys.readouterr()
|
||||
assert len(folder_contents) == 1
|
||||
assert 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png'\
|
||||
' from submission m1hqw6 already exists' in output.out
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_submission_id', 'expected_files_len'), (
|
||||
('ljyy27', 4),
|
||||
))
|
||||
def test_download_submission(
|
||||
test_submission_id: str,
|
||||
expected_files_len: int,
|
||||
downloader_mock: MagicMock,
|
||||
reddit_instance: praw.Reddit,
|
||||
tmp_path: Path):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
downloader_mock.download_filter.check_url.return_value = True
|
||||
downloader_mock.args.folder_scheme = ''
|
||||
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
|
||||
downloader_mock.download_directory = tmp_path
|
||||
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
|
||||
original = Path(tmp_path, f'{test_submission_id}.png')
|
||||
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
assert original.exists()
|
||||
|
||||
downloader_mock.args.file_scheme = 'test2_{POSTID}'
|
||||
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
|
||||
RedditDownloader._download_submission(downloader_mock, submission)
|
||||
test_file_1_stats = original.stat()
|
||||
test_file_2_inode = Path(tmp_path, f'test2_{test_submission_id}.png').stat().st_ino
|
||||
|
||||
assert test_file_1_stats.st_nlink == 2
|
||||
assert test_file_1_stats.st_ino == test_file_2_inode
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_ids', 'test_excluded', 'expected_len'), (
|
||||
(('aaaaaa',), (), 1),
|
||||
(('aaaaaa',), ('aaaaaa',), 0),
|
||||
((), ('aaaaaa',), 0),
|
||||
(('aaaaaa', 'bbbbbb'), ('aaaaaa',), 1),
|
||||
))
|
||||
def test_excluded_ids(test_ids: tuple[str], test_excluded: tuple[str], expected_len: int, downloader_mock: MagicMock):
|
||||
downloader_mock.excluded_submission_ids = test_excluded
|
||||
test_submissions = []
|
||||
for test_id in test_ids:
|
||||
m = MagicMock()
|
||||
m.id = test_id
|
||||
test_submissions.append(m)
|
||||
downloader_mock.reddit_lists = [test_submissions]
|
||||
RedditDownloader.download(downloader_mock)
|
||||
assert downloader_mock._download_submission.call_count == expected_len
|
||||
|
||||
|
||||
def test_read_excluded_submission_ids_from_file(downloader_mock: MagicMock, tmp_path: Path):
|
||||
test_file = tmp_path / 'test.txt'
|
||||
test_file.write_text('aaaaaa\nbbbbbb')
|
||||
downloader_mock.args.exclude_id_file = [test_file]
|
||||
results = RedditDownloader._read_excluded_ids(downloader_mock)
|
||||
assert results == {'aaaaaa', 'bbbbbb'}
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'Paracortex',
|
||||
'crowdstrike',
|
||||
'HannibalGoddamnit',
|
||||
))
|
||||
def test_check_user_existence_good(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'lhnhfkuhwreolo',
|
||||
'adlkfmnhglojh',
|
||||
))
|
||||
def test_check_user_existence_nonexistent(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
with pytest.raises(BulkDownloaderException, match='Could not find'):
|
||||
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_redditor_name', (
|
||||
'Bree-Boo',
|
||||
))
|
||||
def test_check_user_existence_banned(
|
||||
test_redditor_name: str,
|
||||
reddit_instance: praw.Reddit,
|
||||
downloader_mock: MagicMock,
|
||||
):
|
||||
downloader_mock.reddit_instance = reddit_instance
|
||||
with pytest.raises(BulkDownloaderException, match='is banned'):
|
||||
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_subreddit_name', 'expected_message'), (
|
||||
('donaldtrump', 'cannot be found'),
|
||||
('submitters', 'private and cannot be scraped')
|
||||
))
|
||||
def test_check_subreddit_status_bad(test_subreddit_name: str, expected_message: str, reddit_instance: praw.Reddit):
|
||||
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
|
||||
with pytest.raises(BulkDownloaderException, match=expected_message):
|
||||
RedditDownloader._check_subreddit_status(test_subreddit)
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize('test_subreddit_name', (
|
||||
'Python',
|
||||
'Mindustry',
|
||||
'TrollXChromosomes',
|
||||
'all',
|
||||
))
|
||||
def test_check_subreddit_status_good(test_subreddit_name: str, reddit_instance: praw.Reddit):
|
||||
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
|
||||
RedditDownloader._check_subreddit_status(test_subreddit)
|
||||
folder_contents = list(tmp_path.iterdir())
|
||||
assert len(folder_contents) == expected_files_len
|
||||
|
||||
@@ -1,17 +1,21 @@
|
||||
#!/usr/bin/env python3
|
||||
# coding=utf-8
|
||||
|
||||
import platform
|
||||
import sys
|
||||
import unittest.mock
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
from unittest.mock import MagicMock
|
||||
import platform
|
||||
|
||||
import praw.models
|
||||
import pytest
|
||||
|
||||
from bdfr.file_name_formatter import FileNameFormatter
|
||||
from bdfr.resource import Resource
|
||||
from bdfr.site_downloaders.base_downloader import BaseDownloader
|
||||
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
@@ -28,10 +32,10 @@ def submission() -> MagicMock:
|
||||
return test
|
||||
|
||||
|
||||
def do_test_string_equality(result: str, expected: str) -> bool:
|
||||
def do_test_string_equality(result: [Path, str], expected: str) -> bool:
|
||||
if platform.system() == 'Windows':
|
||||
expected = FileNameFormatter._format_for_windows(expected)
|
||||
return expected == result
|
||||
return str(result).endswith(expected)
|
||||
|
||||
|
||||
def do_test_path_equality(result: Path, expected: str) -> bool:
|
||||
@@ -41,7 +45,7 @@ def do_test_path_equality(result: Path, expected: str) -> bool:
|
||||
expected = Path(*expected)
|
||||
else:
|
||||
expected = Path(expected)
|
||||
return result == expected
|
||||
return str(result).endswith(str(expected))
|
||||
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
@@ -118,7 +122,7 @@ def test_format_full(
|
||||
format_string_file: str,
|
||||
expected: str,
|
||||
reddit_submission: praw.models.Submission):
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
|
||||
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
|
||||
result = test_formatter.format_path(test_resource, Path('test'))
|
||||
assert do_test_path_equality(result, expected)
|
||||
@@ -135,7 +139,7 @@ def test_format_full_conform(
|
||||
format_string_directory: str,
|
||||
format_string_file: str,
|
||||
reddit_submission: praw.models.Submission):
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
|
||||
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
|
||||
test_formatter.format_path(test_resource, Path('test'))
|
||||
|
||||
@@ -155,7 +159,7 @@ def test_format_full_with_index_suffix(
|
||||
expected: str,
|
||||
reddit_submission: praw.models.Submission,
|
||||
):
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
|
||||
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
|
||||
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
|
||||
result = test_formatter.format_path(test_resource, Path('test'), index)
|
||||
assert do_test_path_equality(result, expected)
|
||||
@@ -172,8 +176,9 @@ def test_format_multiple_resources():
|
||||
mocks.append(new_mock)
|
||||
test_formatter = FileNameFormatter('{TITLE}', '', 'ISO')
|
||||
results = test_formatter.format_resource_paths(mocks, Path('.'))
|
||||
results = set([str(res[0]) for res in results])
|
||||
assert results == {'test_1.png', 'test_2.png', 'test_3.png', 'test_4.png'}
|
||||
results = set([str(res[0].name) for res in results])
|
||||
expected = {'test_1.png', 'test_2.png', 'test_3.png', 'test_4.png'}
|
||||
assert results == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_filename', 'test_ending'), (
|
||||
@@ -183,10 +188,11 @@ def test_format_multiple_resources():
|
||||
('😍💕✨' * 100, '_1.png'),
|
||||
))
|
||||
def test_limit_filename_length(test_filename: str, test_ending: str):
|
||||
result = FileNameFormatter._limit_file_name_length(test_filename, test_ending)
|
||||
assert len(result) <= 255
|
||||
assert len(result.encode('utf-8')) <= 255
|
||||
assert isinstance(result, str)
|
||||
result = FileNameFormatter.limit_file_name_length(test_filename, test_ending, Path('.'))
|
||||
assert len(result.name) <= 255
|
||||
assert len(result.name.encode('utf-8')) <= 255
|
||||
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
|
||||
assert isinstance(result, Path)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_filename', 'test_ending', 'expected_end'), (
|
||||
@@ -201,25 +207,41 @@ def test_limit_filename_length(test_filename: str, test_ending: str):
|
||||
('😍💕✨' * 100 + '_aaa1aa', '_1.png', '_aaa1aa_1.png'),
|
||||
))
|
||||
def test_preserve_id_append_when_shortening(test_filename: str, test_ending: str, expected_end: str):
|
||||
result = FileNameFormatter._limit_file_name_length(test_filename, test_ending)
|
||||
assert len(result) <= 255
|
||||
assert len(result.encode('utf-8')) <= 255
|
||||
assert isinstance(result, str)
|
||||
assert result.endswith(expected_end)
|
||||
result = FileNameFormatter.limit_file_name_length(test_filename, test_ending, Path('.'))
|
||||
assert len(result.name) <= 255
|
||||
assert len(result.name.encode('utf-8')) <= 255
|
||||
assert result.name.endswith(expected_end)
|
||||
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
|
||||
|
||||
|
||||
def test_shorten_filenames(submission: MagicMock, tmp_path: Path):
|
||||
submission.title = 'A' * 300
|
||||
@pytest.mark.skipif(sys.platform == 'win32', reason='Test broken on windows github')
|
||||
def test_shorten_filename_real(submission: MagicMock, tmp_path: Path):
|
||||
submission.title = 'A' * 500
|
||||
submission.author.name = 'test'
|
||||
submission.subreddit.display_name = 'test'
|
||||
submission.id = 'BBBBBB'
|
||||
test_resource = Resource(submission, 'www.example.com/empty', '.jpeg')
|
||||
test_resource = Resource(submission, 'www.example.com/empty', lambda: None, '.jpeg')
|
||||
test_formatter = FileNameFormatter('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}', 'ISO')
|
||||
result = test_formatter.format_path(test_resource, tmp_path)
|
||||
result.parent.mkdir(parents=True)
|
||||
result.touch()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_name', 'test_ending'), (
|
||||
('a', 'b'),
|
||||
('a', '_bbbbbb.jpg'),
|
||||
('a' * 20, '_bbbbbb.jpg'),
|
||||
('a' * 50, '_bbbbbb.jpg'),
|
||||
('a' * 500, '_bbbbbb.jpg'),
|
||||
))
|
||||
def test_shorten_path(test_name: str, test_ending: str, tmp_path: Path):
|
||||
result = FileNameFormatter.limit_file_name_length(test_name, test_ending, tmp_path)
|
||||
assert len(str(result.name)) <= 255
|
||||
assert len(str(result.name).encode('UTF-8')) <= 255
|
||||
assert len(str(result.name).encode('cp1252')) <= 255
|
||||
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_string', 'expected'), (
|
||||
('test', 'test'),
|
||||
('test😍', 'test'),
|
||||
@@ -293,9 +315,9 @@ def test_format_archive_entry_comment(
|
||||
):
|
||||
test_comment = reddit_instance.comment(id=test_comment_id)
|
||||
test_formatter = FileNameFormatter(test_file_scheme, test_folder_scheme, 'ISO')
|
||||
test_entry = Resource(test_comment, '', '.json')
|
||||
test_entry = Resource(test_comment, '', lambda: None, '.json')
|
||||
result = test_formatter.format_path(test_entry, tmp_path)
|
||||
assert do_test_string_equality(result.name, expected_name)
|
||||
assert do_test_string_equality(result, expected_name)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(('test_folder_scheme', 'expected'), (
|
||||
@@ -364,3 +386,36 @@ def test_time_string_formats(test_time_format: str, expected: str):
|
||||
test_formatter = FileNameFormatter('{TITLE}', '', test_time_format)
|
||||
result = test_formatter._convert_timestamp(test_time.timestamp())
|
||||
assert result == expected
|
||||
|
||||
|
||||
def test_get_max_path_length():
|
||||
result = FileNameFormatter.find_max_path_length()
|
||||
assert result in (4096, 260, 1024)
|
||||
|
||||
|
||||
def test_windows_max_path(tmp_path: Path):
|
||||
with unittest.mock.patch('platform.system', return_value='Windows'):
|
||||
with unittest.mock.patch('bdfr.file_name_formatter.FileNameFormatter.find_max_path_length', return_value=260):
|
||||
result = FileNameFormatter.limit_file_name_length('test' * 100, '_1.png', tmp_path)
|
||||
assert len(str(result)) <= 260
|
||||
assert len(result.name) <= (260 - len(str(tmp_path)))
|
||||
|
||||
|
||||
@pytest.mark.online
|
||||
@pytest.mark.reddit
|
||||
@pytest.mark.parametrize(('test_reddit_id', 'test_downloader', 'expected_names'), (
|
||||
('gphmnr', YtdlpFallback, {'He has a lot to say today.mp4'}),
|
||||
('d0oir2', YtdlpFallback, {"Crunk's finest moment. Welcome to the new subreddit!.mp4"}),
|
||||
))
|
||||
def test_name_submission(
|
||||
test_reddit_id: str,
|
||||
test_downloader: type(BaseDownloader),
|
||||
expected_names: set[str],
|
||||
reddit_instance: praw.reddit.Reddit,
|
||||
):
|
||||
test_submission = reddit_instance.submission(id=test_reddit_id)
|
||||
test_resources = test_downloader(test_submission).find_resources()
|
||||
test_formatter = FileNameFormatter('{TITLE}', '', '')
|
||||
results = test_formatter.format_resource_paths(test_resources, Path('.'))
|
||||
results = set([r[0].name for r in results])
|
||||
assert expected_names == results
|
||||
|
||||
@@ -21,7 +21,7 @@ from bdfr.resource import Resource
|
||||
('https://www.test.com/test/test2/example.png?random=test#thing', '.png'),
|
||||
))
|
||||
def test_resource_get_extension(test_url: str, expected: str):
|
||||
test_resource = Resource(MagicMock(), test_url)
|
||||
test_resource = Resource(MagicMock(), test_url, lambda: None)
|
||||
result = test_resource._determine_extension()
|
||||
assert result == expected
|
||||
|
||||
@@ -31,6 +31,6 @@ def test_resource_get_extension(test_url: str, expected: str):
|
||||
('https://www.iana.org/_img/2013.1/iana-logo-header.svg', '426b3ac01d3584c820f3b7f5985d6623'),
|
||||
))
|
||||
def test_download_online_resource(test_url: str, expected_hash: str):
|
||||
test_resource = Resource(MagicMock(), test_url)
|
||||
test_resource.download(120)
|
||||
test_resource = Resource(MagicMock(), test_url, Resource.retry_download(test_url))
|
||||
test_resource.download()
|
||||
assert test_resource.hash.hexdigest() == expected_hash
|
||||
|
||||
Reference in New Issue
Block a user