Merge branch 'aliparlakci:master' into master

This commit is contained in:
BlipRanger
2022-04-25 11:33:32 -04:00
committed by GitHub
76 changed files with 2455 additions and 1275 deletions

2
.gitattributes vendored Normal file
View File

@@ -0,0 +1,2 @@
# Declare files that will always have CRLF line endings on checkout.
*.ps1 text eol=crlf

View File

@@ -9,7 +9,7 @@ assignees: ''
- [ ] I am reporting a bug.
- [ ] I am running the latest version of BDfR
- [ ] I have read the [Opening an issue](../../README.md#configuration)
- [ ] I have read the [Opening an issue](https://github.com/aliparlakci/bulk-downloader-for-reddit/blob/master/docs/CONTRIBUTING.md#opening-an-issue)
## Description
A clear and concise description of what the bug is.

View File

@@ -27,3 +27,9 @@ jobs:
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
- name: Upload coverage report
uses: actions/upload-artifact@v2
with:
name: dist
path: dist/

3
.gitignore vendored
View File

@@ -139,3 +139,6 @@ cython_debug/
# Test configuration file
test_config.cfg
.vscode/
.idea/

9
.gitmodules vendored Normal file
View File

@@ -0,0 +1,9 @@
[submodule "scripts/tests/bats"]
path = scripts/tests/bats
url = https://github.com/bats-core/bats-core.git
[submodule "scripts/tests/test_helper/bats-assert"]
path = scripts/tests/test_helper/bats-assert
url = https://github.com/bats-core/bats-assert.git
[submodule "scripts/tests/test_helper/bats-support"]
path = scripts/tests/test_helper/bats-support
url = https://github.com/bats-core/bats-support.git

View File

@@ -1,11 +1,14 @@
# Bulk Downloader for Reddit
[![Python Test](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml/badge.svg?branch=v2)](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
[![PyPI version](https://badge.fury.io/py/bdfr.svg)](https://badge.fury.io/py/bdfr)
[![PyPI version](https://img.shields.io/pypi/v/bdfr.svg)](https://pypi.python.org/pypi/bdfr)
[![PyPI downloads](https://img.shields.io/pypi/dm/bdfr)](https://pypi.python.org/pypi/bdfr)
[![Python Test](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/aliparlakci/bulk-downloader-for-reddit/actions/workflows/test.yml)
This is a tool to download submissions or submission data from Reddit. It can be used to archive data or even crawl Reddit to gather research data. The BDFR is flexible and can be used in scripts if needed through an extensive command-line interface. [List of currently supported sources](#list-of-currently-supported-sources)
If you wish to open an issue, please read [the guide on opening issues](docs/CONTRIBUTING.md#opening-an-issue) to ensure that your issue is clear and contains everything it needs to for the developers to investigate.
Included in this README are a few example Bash tricks to get certain behaviour. For that, see [Common Command Tricks](#common-command-tricks).
## Installation
*Bulk Downloader for Reddit* needs Python version 3.9 or above. Please update Python before installation to meet the requirement. Then, you can install it as such:
```bash
@@ -26,16 +29,24 @@ If you want to use the source code or make contributions, refer to [CONTRIBUTING
The BDFR works by taking submissions from a variety of "sources" from Reddit and then parsing them to download. These sources might be a subreddit, multireddit, a user list, or individual links. These sources are combined and downloaded to disk, according to a naming and organisational scheme defined by the user.
There are two modes to the BDFR: download, and archive. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML.
There are three modes to the BDFR: download, archive, and clone. Each one has a command that performs similar but distinct functions. The `download` command will download the resource linked in the Reddit submission, such as the images, video, etc. The `archive` command will download the submission data itself and store it, such as the submission details, upvotes, text, statistics, as and all the comments on that submission. These can then be saved in a data markup language form, such as JSON, XML, or YAML. Lastly, the `clone` command will perform both functions of the previous commands at once and is more efficient than running those commands sequentially.
Note that the `clone` command is not a true, failthful clone of Reddit. It simply retrieves much of the raw data that Reddit provides. To get a true clone of Reddit, another tool such as HTTrack should be used.
After installation, run the program from any directory as shown below:
```bash
python3 -m bdfr download
```
```bash
python3 -m bdfr archive
```
```bash
python3 -m bdfr clone
```
However, these commands are not enough. You should chain parameters in [Options](#options) according to your use case. Don't forget that some parameters can be provided multiple times. Some quick reference commands are:
```bash
@@ -63,6 +74,17 @@ The following options are common between both the `archive` and `download` comma
- `--config`
- If the path to a configuration file is supplied with this option, the BDFR will use the specified config
- See [Configuration Files](#configuration) for more details
- `--disable-module`
- Can be specified multiple times
- Disables certain modules from being used
- See [Disabling Modules](#disabling-modules) for more information and a list of module names
- `--ignore-user`
- This will add a user to ignore
- Can be specified multiple times
- `--include-id-file`
- This will add any submission with the IDs in the files provided
- Can be specified multiple times
- Format is one ID per line
- `--log`
- This allows one to specify the location of the logfile
- This must be done when running multiple instances of the BDFR, see [Multiple Instances](#multiple-instances) below
@@ -123,6 +145,8 @@ The following options are common between both the `archive` and `download` comma
- `-u, --user`
- This specifies the user to scrape in concert with other options
- When using `--authenticate`, `--user me` can be used to refer to the authenticated user
- Can be specified multiple times for multiple users
- If downloading a multireddit, only one user can be specified
- `-v, --verbose`
- Increases the verbosity of the program
- Can be specified multiple times
@@ -131,13 +155,6 @@ The following options are common between both the `archive` and `download` comma
The following options apply only to the `download` command. This command downloads the files and resources linked to in the submission, or a text submission itself, to the disk in the specified directory.
- `--exclude-id`
- This will skip the download of any submission with the ID provided
- Can be specified multiple times
- `--exclude-id-file`
- This will skip the download of any submission with any of the IDs in the files provided
- Can be specified multiple times
- Format is one ID per line
- `--make-hard-links`
- This flag will create hard links to an existing file when a duplicate is downloaded
- This will make the file appear in multiple directories while only taking the space of a single instance
@@ -158,6 +175,13 @@ The following options apply only to the `download` command. This command downloa
- Sets the scheme for folders
- Default is `{SUBREDDIT}`
- See [Folder and File Name Schemes](#folder-and-file-name-schemes) for more details
- `--exclude-id`
- This will skip the download of any submission with the ID provided
- Can be specified multiple times
- `--exclude-id-file`
- This will skip the download of any submission with any of the IDs in the files provided
- Can be specified multiple times
- Format is one ID per line
- `--skip-domain`
- This adds domains to the download filter i.e. submissions coming from these domains will not be downloaded
- Can be specified multiple times
@@ -181,6 +205,23 @@ The following options are for the `archive` command specifically.
- `json` (default)
- `xml`
- `yaml`
- `--comment-context`
- This option will, instead of downloading an individual comment, download the submission that comment is a part of
- May result in a longer run time as it retrieves much more data
### Cloner Options
The `clone` command can take all the options listed above for both the `archive` and `download` commands since it performs the functions of both.
## Common Command Tricks
A common use case is for subreddits/users to be loaded from a file. The BDFR doesn't support this directly but it is simple enough to do through the command-line. Consider a list of usernames to download; they can be passed through to the BDFR with the following command, assuming that the usernames are in a text file:
```bash
cat users.txt | xargs -L 1 echo --user | xargs -L 50 python3 -m bdfr download <ARGS>
```
The part `-L 50` is to make sure that the character limit for a single line isn't exceeded, but may not be necessary. This can also be used to load subreddits from a file, simply exchange `--user` with `--subreddit` and so on.
## Authentication and Security
@@ -252,6 +293,7 @@ The following keys are optional, and defaults will be used if they cannot be fou
- `backup_log_count`
- `max_wait_time`
- `time_format`
- `disabled_modules`
All of these should not be modified unless you know what you're doing, as the default values will enable the BDFR to function just fine. A configuration is included in the BDFR when it is installed, and this will be placed in the configuration directory as the default.
@@ -263,6 +305,22 @@ The option `time_format` will specify the format of the timestamp that replaces
The format can be specified through the [format codes](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior) that are standard in the Python `datetime` library.
#### Disabling Modules
The individual modules of the BDFR, used to download submissions from websites, can be disabled. This is helpful especially in the case of the fallback downloaders, since the `--skip-domain` option cannot be effectively used in these cases. For example, the Youtube-DL downloader can retrieve data from hundreds of websites and domains; thus the only way to fully disable it is via the `--disable-module` option.
Modules can be disabled through the command line interface for the BDFR or more permanently in the configuration file via the `disabled_modules` option. The list of downloaders that can be disabled are the following. Note that they are case-insensitive.
- `Direct`
- `Erome`
- `Gallery` (Reddit Image Galleries)
- `Gfycat`
- `Imgur`
- `Redgifs`
- `SelfPost` (Reddit Text Post)
- `Youtube`
- `YoutubeDlFallback`
### Rate Limiting
The option `max_wait_time` has to do with retrying downloads. There are certain HTTP errors that mean that no amount of requests will return the wanted data, but some errors are from rate-limiting. This is when a single client is making so many requests that the remote website cuts the client off to preserve the function of the site. This is a common situation when downloading many resources from the same site. It is polite and best practice to obey the website's wishes in these cases.
@@ -277,10 +335,14 @@ The BDFR can be run in multiple instances with multiple configurations, either c
Running these scenarios consecutively is done easily, like any single run. Configuration files that differ may be specified with the `--config` option to switch between tokens, for example. Otherwise, almost all configuration for data sources can be specified per-run through the command line.
Running scenarious concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
Running scenarios concurrently (at the same time) however, is more complicated. The BDFR will look to a single, static place to put the detailed log files, in a directory with the configuration file specified above. If there are multiple instances, or processes, of the BDFR running at the same time, they will all be trying to write to a single file. On Linux and other UNIX based operating systems, this will succeed, though there is a substantial risk that the logfile will be useless due to garbled and jumbled data. On Windows however, attempting this will raise an error that crashes the program as Windows forbids multiple processes from accessing the same file.
The way to fix this is to use the `--log` option to manually specify where the logfile is to be stored. If the given location is unique to each instance of the BDFR, then it will run fine.
## Manipulating Logfiles
The logfiles that the BDFR outputs are consistent and quite detailed and in a format that is amenable to regex. To this end, a number of bash scripts have been [included here](./scripts). They show examples for how to extract successfully downloaded IDs, failed IDs, and more besides.
## List of currently supported sources
- Direct links (links leading to a file)

View File

@@ -6,6 +6,7 @@ import sys
import click
from bdfr.archiver import Archiver
from bdfr.cloner import RedditCloner
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
@@ -13,30 +14,55 @@ logger = logging.getLogger()
_common_options = [
click.argument('directory', type=str),
click.option('--config', type=str, default=None),
click.option('-v', '--verbose', default=None, count=True),
click.option('-l', '--link', multiple=True, default=None, type=str),
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
click.option('-L', '--limit', default=None, type=int),
click.option('--authenticate', is_flag=True, default=None),
click.option('--config', type=str, default=None),
click.option('--disable-module', multiple=True, default=None, type=str),
click.option('--ignore-user', type=str, multiple=True, default=None),
click.option('--include-id-file', multiple=True, default=None),
click.option('--log', type=str, default=None),
click.option('--submitted', is_flag=True, default=None),
click.option('--upvoted', is_flag=True, default=None),
click.option('--saved', is_flag=True, default=None),
click.option('--search', default=None, type=str),
click.option('--submitted', is_flag=True, default=None),
click.option('--time-format', type=str, default=None),
click.option('-u', '--user', type=str, default=None),
click.option('--upvoted', is_flag=True, default=None),
click.option('-L', '--limit', default=None, type=int),
click.option('-l', '--link', multiple=True, default=None, type=str),
click.option('-m', '--multireddit', multiple=True, default=None, type=str),
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new', 'controversial', 'rising', 'relevance')),
default=None),
click.option('-s', '--subreddit', multiple=True, default=None, type=str),
click.option('-t', '--time', type=click.Choice(('all', 'hour', 'day', 'week', 'month', 'year')), default=None),
click.option('-S', '--sort', type=click.Choice(('hot', 'top', 'new',
'controversial', 'rising', 'relevance')), default=None),
click.option('-u', '--user', type=str, multiple=True, default=None),
click.option('-v', '--verbose', default=None, count=True),
]
_downloader_options = [
click.option('--file-scheme', default=None, type=str),
click.option('--folder-scheme', default=None, type=str),
click.option('--make-hard-links', is_flag=True, default=None),
click.option('--max-wait-time', type=int, default=None),
click.option('--no-dupes', is_flag=True, default=None),
click.option('--search-existing', is_flag=True, default=None),
click.option('--exclude-id', default=None, multiple=True),
click.option('--exclude-id-file', default=None, multiple=True),
click.option('--skip', default=None, multiple=True),
click.option('--skip-domain', default=None, multiple=True),
click.option('--skip-subreddit', default=None, multiple=True),
]
_archiver_options = [
click.option('--all-comments', is_flag=True, default=None),
click.option('--comment-context', is_flag=True, default=None),
click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None),
]
def _add_common_options(func):
for opt in _common_options:
func = opt(func)
return func
def _add_options(opts: list):
def wrap(func):
for opt in opts:
func = opt(func)
return func
return wrap
@click.group()
@@ -45,18 +71,8 @@ def cli():
@cli.command('download')
@click.option('--exclude-id', default=None, multiple=True)
@click.option('--exclude-id-file', default=None, multiple=True)
@click.option('--file-scheme', default=None, type=str)
@click.option('--folder-scheme', default=None, type=str)
@click.option('--make-hard-links', is_flag=True, default=None)
@click.option('--max-wait-time', type=int, default=None)
@click.option('--no-dupes', is_flag=True, default=None)
@click.option('--search-existing', is_flag=True, default=None)
@click.option('--skip', default=None, multiple=True)
@click.option('--skip-domain', default=None, multiple=True)
@click.option('--skip-subreddit', default=None, multiple=True)
@_add_common_options
@_add_options(_common_options)
@_add_options(_downloader_options)
@click.pass_context
def cli_download(context: click.Context, **_):
config = Configuration()
@@ -73,9 +89,8 @@ def cli_download(context: click.Context, **_):
@cli.command('archive')
@_add_common_options
@click.option('--all-comments', is_flag=True, default=None)
@click.option('-f', '--format', type=click.Choice(('xml', 'json', 'yaml')), default=None)
@_add_options(_common_options)
@_add_options(_archiver_options)
@click.pass_context
def cli_archive(context: click.Context, **_):
config = Configuration()
@@ -85,7 +100,26 @@ def cli_archive(context: click.Context, **_):
reddit_archiver = Archiver(config)
reddit_archiver.download()
except Exception:
logger.exception('Downloader exited unexpectedly')
logger.exception('Archiver exited unexpectedly')
raise
else:
logger.info('Program complete')
@cli.command('clone')
@_add_options(_common_options)
@_add_options(_archiver_options)
@_add_options(_downloader_options)
@click.pass_context
def cli_clone(context: click.Context, **_):
config = Configuration()
config.process_click_arguments(context)
setup_logging(config.verbose)
try:
reddit_scraper = RedditCloner(config)
reddit_scraper.download()
except Exception:
logger.exception('Scraper exited unexpectedly')
raise
else:
logger.info('Program complete')

View File

@@ -22,10 +22,12 @@ class BaseArchiveEntry(ABC):
'id': in_comment.id,
'score': in_comment.score,
'subreddit': in_comment.subreddit.display_name,
'author_flair': in_comment.author_flair_text,
'submission': in_comment.submission.id,
'stickied': in_comment.stickied,
'body': in_comment.body,
'is_submitter': in_comment.is_submitter,
'distinguished': in_comment.distinguished,
'created_utc': in_comment.created_utc,
'parent_id': in_comment.parent_id,
'replies': [],

View File

@@ -35,6 +35,10 @@ class SubmissionArchiveEntry(BaseArchiveEntry):
'link_flair_text': self.source.link_flair_text,
'num_comments': self.source.num_comments,
'over_18': self.source.over_18,
'spoiler': self.source.spoiler,
'pinned': self.source.pinned,
'locked': self.source.locked,
'distinguished': self.source.distinguished,
'created_utc': self.source.created_utc,
}

View File

@@ -14,24 +14,30 @@ from bdfr.archive_entry.base_archive_entry import BaseArchiveEntry
from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
from bdfr.connector import RedditConnector
from bdfr.exceptions import ArchiverError
from bdfr.resource import Resource
logger = logging.getLogger(__name__)
class Archiver(RedditDownloader):
class Archiver(RedditConnector):
def __init__(self, args: Configuration):
super(Archiver, self).__init__(args)
def download(self):
for generator in self.reddit_lists:
for submission in generator:
if (submission.author and submission.author.name in self.args.ignore_user) or \
(submission.author is None and 'DELETED' in self.args.ignore_user):
logger.debug(
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
continue
logger.debug(f'Attempting to archive submission {submission.id}')
self._write_entry(submission)
self.write_entry(submission)
def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions = []
for sub_id in self.args.link:
if len(sub_id) == 6:
@@ -42,12 +48,13 @@ class Archiver(RedditDownloader):
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
return [supplied_submissions]
def _get_user_data(self) -> list[Iterator]:
results = super(Archiver, self)._get_user_data()
def get_user_data(self) -> list[Iterator]:
results = super(Archiver, self).get_user_data()
if self.args.user and self.args.all_comments:
sort = self._determine_sort_function()
logger.debug(f'Retrieving comments of user {self.args.user}')
results.append(sort(self.reddit_instance.redditor(self.args.user).comments, limit=self.args.limit))
sort = self.determine_sort_function()
for user in self.args.user:
logger.debug(f'Retrieving comments of user {user}')
results.append(sort(self.reddit_instance.redditor(user).comments, limit=self.args.limit))
return results
@staticmethod
@@ -59,7 +66,10 @@ class Archiver(RedditDownloader):
else:
raise ArchiverError(f'Factory failed to classify item of type {type(praw_item).__name__}')
def _write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
def write_entry(self, praw_item: (praw.models.Submission, praw.models.Comment)):
if self.args.comment_context and isinstance(praw_item, praw.models.Comment):
logger.debug(f'Converting comment {praw_item.id} to submission {praw_item.submission.id}')
praw_item = praw_item.submission
archive_entry = self._pull_lever_entry_factory(praw_item)
if self.args.format == 'json':
self._write_entry_json(archive_entry)
@@ -72,17 +82,17 @@ class Archiver(RedditDownloader):
logger.info(f'Record for entry item {praw_item.id} written to disk')
def _write_entry_json(self, entry: BaseArchiveEntry):
resource = Resource(entry.source, '', '.json')
resource = Resource(entry.source, '', lambda: None, '.json')
content = json.dumps(entry.compile())
self._write_content_to_disk(resource, content)
def _write_entry_xml(self, entry: BaseArchiveEntry):
resource = Resource(entry.source, '', '.xml')
resource = Resource(entry.source, '', lambda: None, '.xml')
content = dict2xml.dict2xml(entry.compile(), wrap='root')
self._write_content_to_disk(resource, content)
def _write_entry_yaml(self, entry: BaseArchiveEntry):
resource = Resource(entry.source, '', '.yaml')
resource = Resource(entry.source, '', lambda: None, '.yaml')
content = yaml.dump(entry.compile())
self._write_content_to_disk(resource, content)

21
bdfr/cloner.py Normal file
View File

@@ -0,0 +1,21 @@
#!/usr/bin/env python3
# coding=utf-8
import logging
from bdfr.archiver import Archiver
from bdfr.configuration import Configuration
from bdfr.downloader import RedditDownloader
logger = logging.getLogger(__name__)
class RedditCloner(RedditDownloader, Archiver):
def __init__(self, args: Configuration):
super(RedditCloner, self).__init__(args)
def download(self):
for generator in self.reddit_lists:
for submission in generator:
self._download_submission(submission)
self.write_entry(submission)

View File

@@ -13,19 +13,23 @@ class Configuration(Namespace):
self.authenticate = False
self.config = None
self.directory: str = '.'
self.disable_module: list[str] = []
self.exclude_id = []
self.exclude_id_file = []
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
self.folder_scheme: str = '{SUBREDDIT}'
self.ignore_user = []
self.include_id_file = []
self.limit: Optional[int] = None
self.link: list[str] = []
self.log: Optional[str] = None
self.make_hard_links = False
self.max_wait_time = None
self.multireddit: list[str] = []
self.no_dupes: bool = False
self.saved: bool = False
self.search: Optional[str] = None
self.search_existing: bool = False
self.file_scheme: str = '{REDDITOR}_{TITLE}_{POSTID}'
self.folder_scheme: str = '{SUBREDDIT}'
self.skip: list[str] = []
self.skip_domain: list[str] = []
self.skip_subreddit: list[str] = []
@@ -35,13 +39,13 @@ class Configuration(Namespace):
self.time: str = 'all'
self.time_format = None
self.upvoted: bool = False
self.user: Optional[str] = None
self.user: list[str] = []
self.verbose: int = 0
self.make_hard_links = False
# Archiver-specific options
self.format = 'json'
self.all_comments = False
self.format = 'json'
self.comment_context: bool = False
def process_click_arguments(self, context: click.Context):
for arg_key in context.params.keys():

424
bdfr/connector.py Normal file
View File

@@ -0,0 +1,424 @@
#!/usr/bin/env python3
# coding=utf-8
import configparser
import importlib.resources
import itertools
import logging
import logging.handlers
import re
import shutil
import socket
from abc import ABCMeta, abstractmethod
from datetime import datetime
from enum import Enum, auto
from pathlib import Path
from typing import Callable, Iterator
import appdirs
import praw
import praw.exceptions
import praw.models
import prawcore
from bdfr import exceptions as errors
from bdfr.configuration import Configuration
from bdfr.download_filter import DownloadFilter
from bdfr.file_name_formatter import FileNameFormatter
from bdfr.oauth2 import OAuth2Authenticator, OAuth2TokenManager
from bdfr.site_authenticator import SiteAuthenticator
logger = logging.getLogger(__name__)
class RedditTypes:
class SortType(Enum):
CONTROVERSIAL = auto()
HOT = auto()
NEW = auto()
RELEVENCE = auto()
RISING = auto()
TOP = auto()
class TimeType(Enum):
ALL = 'all'
DAY = 'day'
HOUR = 'hour'
MONTH = 'month'
WEEK = 'week'
YEAR = 'year'
class RedditConnector(metaclass=ABCMeta):
def __init__(self, args: Configuration):
self.args = args
self.config_directories = appdirs.AppDirs('bdfr', 'BDFR')
self.run_time = datetime.now().isoformat()
self._setup_internal_objects()
self.reddit_lists = self.retrieve_reddit_lists()
def _setup_internal_objects(self):
self.determine_directories()
self.load_config()
self.create_file_logger()
self.read_config()
self.parse_disabled_modules()
self.download_filter = self.create_download_filter()
logger.log(9, 'Created download filter')
self.time_filter = self.create_time_filter()
logger.log(9, 'Created time filter')
self.sort_filter = self.create_sort_filter()
logger.log(9, 'Created sort filter')
self.file_name_formatter = self.create_file_name_formatter()
logger.log(9, 'Create file name formatter')
self.create_reddit_instance()
self.args.user = list(filter(None, [self.resolve_user_name(user) for user in self.args.user]))
self.excluded_submission_ids = set.union(
self.read_id_files(self.args.exclude_id_file),
set(self.args.exclude_id),
)
self.args.link = list(itertools.chain(self.args.link, self.read_id_files(self.args.include_id_file)))
self.master_hash_list = {}
self.authenticator = self.create_authenticator()
logger.log(9, 'Created site authenticator')
self.args.skip_subreddit = self.split_args_input(self.args.skip_subreddit)
self.args.skip_subreddit = set([sub.lower() for sub in self.args.skip_subreddit])
def read_config(self):
"""Read any cfg values that need to be processed"""
if self.args.max_wait_time is None:
self.args.max_wait_time = self.cfg_parser.getint('DEFAULT', 'max_wait_time', fallback=120)
logger.debug(f'Setting maximum download wait time to {self.args.max_wait_time} seconds')
if self.args.time_format is None:
option = self.cfg_parser.get('DEFAULT', 'time_format', fallback='ISO')
if re.match(r'^[\s\'\"]*$', option):
option = 'ISO'
logger.debug(f'Setting datetime format string to {option}')
self.args.time_format = option
if not self.args.disable_module:
self.args.disable_module = [self.cfg_parser.get('DEFAULT', 'disabled_modules', fallback='')]
# Update config on disk
with open(self.config_location, 'w') as file:
self.cfg_parser.write(file)
def parse_disabled_modules(self):
disabled_modules = self.args.disable_module
disabled_modules = self.split_args_input(disabled_modules)
disabled_modules = set([name.strip().lower() for name in disabled_modules])
self.args.disable_module = disabled_modules
logger.debug(f'Disabling the following modules: {", ".join(self.args.disable_module)}')
def create_reddit_instance(self):
if self.args.authenticate:
logger.debug('Using authenticated Reddit instance')
if not self.cfg_parser.has_option('DEFAULT', 'user_token'):
logger.log(9, 'Commencing OAuth2 authentication')
scopes = self.cfg_parser.get('DEFAULT', 'scopes', fallback='identity, history, read, save')
scopes = OAuth2Authenticator.split_scopes(scopes)
oauth2_authenticator = OAuth2Authenticator(
scopes,
self.cfg_parser.get('DEFAULT', 'client_id'),
self.cfg_parser.get('DEFAULT', 'client_secret'),
)
token = oauth2_authenticator.retrieve_new_token()
self.cfg_parser['DEFAULT']['user_token'] = token
with open(self.config_location, 'w') as file:
self.cfg_parser.write(file, True)
token_manager = OAuth2TokenManager(self.cfg_parser, self.config_location)
self.authenticated = True
self.reddit_instance = praw.Reddit(
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
user_agent=socket.gethostname(),
token_manager=token_manager,
)
else:
logger.debug('Using unauthenticated Reddit instance')
self.authenticated = False
self.reddit_instance = praw.Reddit(
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
user_agent=socket.gethostname(),
)
def retrieve_reddit_lists(self) -> list[praw.models.ListingGenerator]:
master_list = []
master_list.extend(self.get_subreddits())
logger.log(9, 'Retrieved subreddits')
master_list.extend(self.get_multireddits())
logger.log(9, 'Retrieved multireddits')
master_list.extend(self.get_user_data())
logger.log(9, 'Retrieved user data')
master_list.extend(self.get_submissions_from_link())
logger.log(9, 'Retrieved submissions for given links')
return master_list
def determine_directories(self):
self.download_directory = Path(self.args.directory).resolve().expanduser()
self.config_directory = Path(self.config_directories.user_config_dir)
self.download_directory.mkdir(exist_ok=True, parents=True)
self.config_directory.mkdir(exist_ok=True, parents=True)
def load_config(self):
self.cfg_parser = configparser.ConfigParser()
if self.args.config:
if (cfg_path := Path(self.args.config)).exists():
self.cfg_parser.read(cfg_path)
self.config_location = cfg_path
return
possible_paths = [
Path('./config.cfg'),
Path('./default_config.cfg'),
Path(self.config_directory, 'config.cfg'),
Path(self.config_directory, 'default_config.cfg'),
]
self.config_location = None
for path in possible_paths:
if path.resolve().expanduser().exists():
self.config_location = path
logger.debug(f'Loading configuration from {path}')
break
if not self.config_location:
with importlib.resources.path('bdfr', 'default_config.cfg') as path:
self.config_location = path
shutil.copy(self.config_location, Path(self.config_directory, 'default_config.cfg'))
if not self.config_location:
raise errors.BulkDownloaderException('Could not find a configuration file to load')
self.cfg_parser.read(self.config_location)
def create_file_logger(self):
main_logger = logging.getLogger()
if self.args.log is None:
log_path = Path(self.config_directory, 'log_output.txt')
else:
log_path = Path(self.args.log).resolve().expanduser()
if not log_path.parent.exists():
raise errors.BulkDownloaderException(f'Designated location for logfile does not exist')
backup_count = self.cfg_parser.getint('DEFAULT', 'backup_log_count', fallback=3)
file_handler = logging.handlers.RotatingFileHandler(
log_path,
mode='a',
backupCount=backup_count,
)
if log_path.exists():
try:
file_handler.doRollover()
except PermissionError:
logger.critical(
'Cannot rollover logfile, make sure this is the only '
'BDFR process or specify alternate logfile location')
raise
formatter = logging.Formatter('[%(asctime)s - %(name)s - %(levelname)s] - %(message)s')
file_handler.setFormatter(formatter)
file_handler.setLevel(0)
main_logger.addHandler(file_handler)
@staticmethod
def sanitise_subreddit_name(subreddit: str) -> str:
pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)/?$')
match = re.match(pattern, subreddit)
if not match:
raise errors.BulkDownloaderException(f'Could not find subreddit name in string {subreddit}')
return match.group(1)
@staticmethod
def split_args_input(entries: list[str]) -> set[str]:
all_entries = []
split_pattern = re.compile(r'[,;]\s?')
for entry in entries:
results = re.split(split_pattern, entry)
all_entries.extend([RedditConnector.sanitise_subreddit_name(name) for name in results])
return set(all_entries)
def get_subreddits(self) -> list[praw.models.ListingGenerator]:
if self.args.subreddit:
out = []
for reddit in self.split_args_input(self.args.subreddit):
if reddit == 'friends' and self.authenticated is False:
logger.error('Cannot read friends subreddit without an authenticated instance')
continue
try:
reddit = self.reddit_instance.subreddit(reddit)
try:
self.check_subreddit_status(reddit)
except errors.BulkDownloaderException as e:
logger.error(e)
continue
if self.args.search:
out.append(reddit.search(
self.args.search,
sort=self.sort_filter.name.lower(),
limit=self.args.limit,
time_filter=self.time_filter.value,
))
logger.debug(
f'Added submissions from subreddit {reddit} with the search term "{self.args.search}"')
else:
out.append(self.create_filtered_listing_generator(reddit))
logger.debug(f'Added submissions from subreddit {reddit}')
except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
return out
else:
return []
def resolve_user_name(self, in_name: str) -> str:
if in_name == 'me':
if self.authenticated:
resolved_name = self.reddit_instance.user.me().name
logger.log(9, f'Resolved user to {resolved_name}')
return resolved_name
else:
logger.warning('To use "me" as a user, an authenticated Reddit instance must be used')
else:
return in_name
def get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions = []
for sub_id in self.args.link:
if len(sub_id) == 6:
supplied_submissions.append(self.reddit_instance.submission(id=sub_id))
else:
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
return [supplied_submissions]
def determine_sort_function(self) -> Callable:
if self.sort_filter is RedditTypes.SortType.NEW:
sort_function = praw.models.Subreddit.new
elif self.sort_filter is RedditTypes.SortType.RISING:
sort_function = praw.models.Subreddit.rising
elif self.sort_filter is RedditTypes.SortType.CONTROVERSIAL:
sort_function = praw.models.Subreddit.controversial
elif self.sort_filter is RedditTypes.SortType.TOP:
sort_function = praw.models.Subreddit.top
else:
sort_function = praw.models.Subreddit.hot
return sort_function
def get_multireddits(self) -> list[Iterator]:
if self.args.multireddit:
if len(self.args.user) != 1:
logger.error(f'Only 1 user can be supplied when retrieving from multireddits')
return []
out = []
for multi in self.split_args_input(self.args.multireddit):
try:
multi = self.reddit_instance.multireddit(self.args.user[0], multi)
if not multi.subreddits:
raise errors.BulkDownloaderException
out.append(self.create_filtered_listing_generator(multi))
logger.debug(f'Added submissions from multireddit {multi}')
except (errors.BulkDownloaderException, praw.exceptions.PRAWException, prawcore.PrawcoreException) as e:
logger.error(f'Failed to get submissions for multireddit {multi}: {e}')
return out
else:
return []
def create_filtered_listing_generator(self, reddit_source) -> Iterator:
sort_function = self.determine_sort_function()
if self.sort_filter in (RedditTypes.SortType.TOP, RedditTypes.SortType.CONTROVERSIAL):
return sort_function(reddit_source, limit=self.args.limit, time_filter=self.time_filter.value)
else:
return sort_function(reddit_source, limit=self.args.limit)
def get_user_data(self) -> list[Iterator]:
if any([self.args.submitted, self.args.upvoted, self.args.saved]):
if not self.args.user:
logger.warning('At least one user must be supplied to download user data')
return []
generators = []
for user in self.args.user:
try:
self.check_user_existence(user)
except errors.BulkDownloaderException as e:
logger.error(e)
continue
if self.args.submitted:
logger.debug(f'Retrieving submitted posts of user {self.args.user}')
generators.append(self.create_filtered_listing_generator(
self.reddit_instance.redditor(user).submissions,
))
if not self.authenticated and any((self.args.upvoted, self.args.saved)):
logger.warning('Accessing user lists requires authentication')
else:
if self.args.upvoted:
logger.debug(f'Retrieving upvoted posts of user {self.args.user}')
generators.append(self.reddit_instance.redditor(user).upvoted(limit=self.args.limit))
if self.args.saved:
logger.debug(f'Retrieving saved posts of user {self.args.user}')
generators.append(self.reddit_instance.redditor(user).saved(limit=self.args.limit))
return generators
else:
return []
def check_user_existence(self, name: str):
user = self.reddit_instance.redditor(name=name)
try:
if user.id:
return
except prawcore.exceptions.NotFound:
raise errors.BulkDownloaderException(f'Could not find user {name}')
except AttributeError:
if hasattr(user, 'is_suspended'):
raise errors.BulkDownloaderException(f'User {name} is banned')
def create_file_name_formatter(self) -> FileNameFormatter:
return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme, self.args.time_format)
def create_time_filter(self) -> RedditTypes.TimeType:
try:
return RedditTypes.TimeType[self.args.time.upper()]
except (KeyError, AttributeError):
return RedditTypes.TimeType.ALL
def create_sort_filter(self) -> RedditTypes.SortType:
try:
return RedditTypes.SortType[self.args.sort.upper()]
except (KeyError, AttributeError):
return RedditTypes.SortType.HOT
def create_download_filter(self) -> DownloadFilter:
return DownloadFilter(self.args.skip, self.args.skip_domain)
def create_authenticator(self) -> SiteAuthenticator:
return SiteAuthenticator(self.cfg_parser)
@abstractmethod
def download(self):
pass
@staticmethod
def check_subreddit_status(subreddit: praw.models.Subreddit):
if subreddit.display_name in ('all', 'friends'):
return
try:
assert subreddit.id
except prawcore.NotFound:
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
except prawcore.Forbidden:
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')
@staticmethod
def read_id_files(file_locations: list[str]) -> set[str]:
out = []
for id_file in file_locations:
id_file = Path(id_file).resolve().expanduser()
if not id_file.exists():
logger.warning(f'ID file at {id_file} does not exist')
continue
with open(id_file, 'r') as file:
for line in file:
out.append(line.strip())
return set(out)

View File

@@ -1,405 +1,70 @@
#!/usr/bin/env python3
# coding=utf-8
import configparser
import hashlib
import importlib.resources
import logging
import logging.handlers
import os
import re
import shutil
import socket
import time
from datetime import datetime
from enum import Enum, auto
from multiprocessing import Pool
from pathlib import Path
from typing import Callable, Iterator
import appdirs
import praw
import praw.exceptions
import praw.models
import prawcore
import bdfr.exceptions as errors
from bdfr import exceptions as errors
from bdfr.configuration import Configuration
from bdfr.download_filter import DownloadFilter
from bdfr.file_name_formatter import FileNameFormatter
from bdfr.oauth2 import OAuth2Authenticator, OAuth2TokenManager
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.connector import RedditConnector
from bdfr.site_downloaders.download_factory import DownloadFactory
logger = logging.getLogger(__name__)
def _calc_hash(existing_file: Path):
chunk_size = 1024 * 1024
md5_hash = hashlib.md5()
with open(existing_file, 'rb') as file:
file_hash = hashlib.md5(file.read()).hexdigest()
return existing_file, file_hash
chunk = file.read(chunk_size)
while chunk:
md5_hash.update(chunk)
chunk = file.read(chunk_size)
file_hash = md5_hash.hexdigest()
return existing_file, file_hash
class RedditTypes:
class SortType(Enum):
CONTROVERSIAL = auto()
HOT = auto()
NEW = auto()
RELEVENCE = auto()
RISING = auto()
TOP = auto()
class TimeType(Enum):
ALL = 'all'
DAY = 'day'
HOUR = 'hour'
MONTH = 'month'
WEEK = 'week'
YEAR = 'year'
class RedditDownloader:
class RedditDownloader(RedditConnector):
def __init__(self, args: Configuration):
self.args = args
self.config_directories = appdirs.AppDirs('bdfr', 'BDFR')
self.run_time = datetime.now().isoformat()
self._setup_internal_objects()
self.reddit_lists = self._retrieve_reddit_lists()
def _setup_internal_objects(self):
self._determine_directories()
self._load_config()
self._create_file_logger()
self._read_config()
self.download_filter = self._create_download_filter()
logger.log(9, 'Created download filter')
self.time_filter = self._create_time_filter()
logger.log(9, 'Created time filter')
self.sort_filter = self._create_sort_filter()
logger.log(9, 'Created sort filter')
self.file_name_formatter = self._create_file_name_formatter()
logger.log(9, 'Create file name formatter')
self._create_reddit_instance()
self._resolve_user_name()
self.excluded_submission_ids = self._read_excluded_ids()
super(RedditDownloader, self).__init__(args)
if self.args.search_existing:
self.master_hash_list = self.scan_existing_files(self.download_directory)
else:
self.master_hash_list = {}
self.authenticator = self._create_authenticator()
logger.log(9, 'Created site authenticator')
self.args.skip_subreddit = self._split_args_input(self.args.skip_subreddit)
self.args.skip_subreddit = set([sub.lower() for sub in self.args.skip_subreddit])
def _read_config(self):
"""Read any cfg values that need to be processed"""
if self.args.max_wait_time is None:
if not self.cfg_parser.has_option('DEFAULT', 'max_wait_time'):
self.cfg_parser.set('DEFAULT', 'max_wait_time', '120')
logger.log(9, 'Wrote default download wait time download to config file')
self.args.max_wait_time = self.cfg_parser.getint('DEFAULT', 'max_wait_time')
logger.debug(f'Setting maximum download wait time to {self.args.max_wait_time} seconds')
if self.args.time_format is None:
option = self.cfg_parser.get('DEFAULT', 'time_format', fallback='ISO')
if re.match(r'^[ \'\"]*$', option):
option = 'ISO'
logger.debug(f'Setting datetime format string to {option}')
self.args.time_format = option
# Update config on disk
with open(self.config_location, 'w') as file:
self.cfg_parser.write(file)
def _create_reddit_instance(self):
if self.args.authenticate:
logger.debug('Using authenticated Reddit instance')
if not self.cfg_parser.has_option('DEFAULT', 'user_token'):
logger.log(9, 'Commencing OAuth2 authentication')
scopes = self.cfg_parser.get('DEFAULT', 'scopes')
scopes = OAuth2Authenticator.split_scopes(scopes)
oauth2_authenticator = OAuth2Authenticator(
scopes,
self.cfg_parser.get('DEFAULT', 'client_id'),
self.cfg_parser.get('DEFAULT', 'client_secret'),
)
token = oauth2_authenticator.retrieve_new_token()
self.cfg_parser['DEFAULT']['user_token'] = token
with open(self.config_location, 'w') as file:
self.cfg_parser.write(file, True)
token_manager = OAuth2TokenManager(self.cfg_parser, self.config_location)
self.authenticated = True
self.reddit_instance = praw.Reddit(
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
user_agent=socket.gethostname(),
token_manager=token_manager,
)
else:
logger.debug('Using unauthenticated Reddit instance')
self.authenticated = False
self.reddit_instance = praw.Reddit(
client_id=self.cfg_parser.get('DEFAULT', 'client_id'),
client_secret=self.cfg_parser.get('DEFAULT', 'client_secret'),
user_agent=socket.gethostname(),
)
def _retrieve_reddit_lists(self) -> list[praw.models.ListingGenerator]:
master_list = []
master_list.extend(self._get_subreddits())
logger.log(9, 'Retrieved subreddits')
master_list.extend(self._get_multireddits())
logger.log(9, 'Retrieved multireddits')
master_list.extend(self._get_user_data())
logger.log(9, 'Retrieved user data')
master_list.extend(self._get_submissions_from_link())
logger.log(9, 'Retrieved submissions for given links')
return master_list
def _determine_directories(self):
self.download_directory = Path(self.args.directory).resolve().expanduser()
self.config_directory = Path(self.config_directories.user_config_dir)
self.download_directory.mkdir(exist_ok=True, parents=True)
self.config_directory.mkdir(exist_ok=True, parents=True)
def _load_config(self):
self.cfg_parser = configparser.ConfigParser()
if self.args.config:
if (cfg_path := Path(self.args.config)).exists():
self.cfg_parser.read(cfg_path)
self.config_location = cfg_path
return
possible_paths = [
Path('./config.cfg'),
Path('./default_config.cfg'),
Path(self.config_directory, 'config.cfg'),
Path(self.config_directory, 'default_config.cfg'),
]
self.config_location = None
for path in possible_paths:
if path.resolve().expanduser().exists():
self.config_location = path
logger.debug(f'Loading configuration from {path}')
break
if not self.config_location:
self.config_location = list(importlib.resources.path('bdfr', 'default_config.cfg').gen)[0]
shutil.copy(self.config_location, Path(self.config_directory, 'default_config.cfg'))
if not self.config_location:
raise errors.BulkDownloaderException('Could not find a configuration file to load')
self.cfg_parser.read(self.config_location)
def _create_file_logger(self):
main_logger = logging.getLogger()
if self.args.log is None:
log_path = Path(self.config_directory, 'log_output.txt')
else:
log_path = Path(self.args.log).resolve().expanduser()
if not log_path.parent.exists():
raise errors.BulkDownloaderException(f'Designated location for logfile does not exist')
backup_count = self.cfg_parser.getint('DEFAULT', 'backup_log_count', fallback=3)
file_handler = logging.handlers.RotatingFileHandler(
log_path,
mode='a',
backupCount=backup_count,
)
if log_path.exists():
try:
file_handler.doRollover()
except PermissionError as e:
logger.critical(
'Cannot rollover logfile, make sure this is the only '
'BDFR process or specify alternate logfile location')
raise
formatter = logging.Formatter('[%(asctime)s - %(name)s - %(levelname)s] - %(message)s')
file_handler.setFormatter(formatter)
file_handler.setLevel(0)
main_logger.addHandler(file_handler)
@staticmethod
def _sanitise_subreddit_name(subreddit: str) -> str:
pattern = re.compile(r'^(?:https://www\.reddit\.com/)?(?:r/)?(.*?)/?$')
match = re.match(pattern, subreddit)
if not match:
raise errors.BulkDownloaderException(f'Could not find subreddit name in string {subreddit}')
return match.group(1)
@staticmethod
def _split_args_input(entries: list[str]) -> set[str]:
all_entries = []
split_pattern = re.compile(r'[,;]\s?')
for entry in entries:
results = re.split(split_pattern, entry)
all_entries.extend([RedditDownloader._sanitise_subreddit_name(name) for name in results])
return set(all_entries)
def _get_subreddits(self) -> list[praw.models.ListingGenerator]:
if self.args.subreddit:
out = []
for reddit in self._split_args_input(self.args.subreddit):
try:
reddit = self.reddit_instance.subreddit(reddit)
try:
self._check_subreddit_status(reddit)
except errors.BulkDownloaderException as e:
logger.error(e)
continue
if self.args.search:
out.append(reddit.search(
self.args.search,
sort=self.sort_filter.name.lower(),
limit=self.args.limit,
time_filter=self.time_filter.value,
))
logger.debug(
f'Added submissions from subreddit {reddit} with the search term "{self.args.search}"')
else:
out.append(self._create_filtered_listing_generator(reddit))
logger.debug(f'Added submissions from subreddit {reddit}')
except (errors.BulkDownloaderException, praw.exceptions.PRAWException) as e:
logger.error(f'Failed to get submissions for subreddit {reddit}: {e}')
return out
else:
return []
def _resolve_user_name(self):
if self.args.user == 'me':
if self.authenticated:
self.args.user = self.reddit_instance.user.me().name
logger.log(9, f'Resolved user to {self.args.user}')
else:
self.args.user = None
logger.warning('To use "me" as a user, an authenticated Reddit instance must be used')
def _get_submissions_from_link(self) -> list[list[praw.models.Submission]]:
supplied_submissions = []
for sub_id in self.args.link:
if len(sub_id) == 6:
supplied_submissions.append(self.reddit_instance.submission(id=sub_id))
else:
supplied_submissions.append(self.reddit_instance.submission(url=sub_id))
return [supplied_submissions]
def _determine_sort_function(self) -> Callable:
if self.sort_filter is RedditTypes.SortType.NEW:
sort_function = praw.models.Subreddit.new
elif self.sort_filter is RedditTypes.SortType.RISING:
sort_function = praw.models.Subreddit.rising
elif self.sort_filter is RedditTypes.SortType.CONTROVERSIAL:
sort_function = praw.models.Subreddit.controversial
elif self.sort_filter is RedditTypes.SortType.TOP:
sort_function = praw.models.Subreddit.top
else:
sort_function = praw.models.Subreddit.hot
return sort_function
def _get_multireddits(self) -> list[Iterator]:
if self.args.multireddit:
out = []
for multi in self._split_args_input(self.args.multireddit):
try:
multi = self.reddit_instance.multireddit(self.args.user, multi)
if not multi.subreddits:
raise errors.BulkDownloaderException
out.append(self._create_filtered_listing_generator(multi))
logger.debug(f'Added submissions from multireddit {multi}')
except (errors.BulkDownloaderException, praw.exceptions.PRAWException, prawcore.PrawcoreException) as e:
logger.error(f'Failed to get submissions for multireddit {multi}: {e}')
return out
else:
return []
def _create_filtered_listing_generator(self, reddit_source) -> Iterator:
sort_function = self._determine_sort_function()
if self.sort_filter in (RedditTypes.SortType.TOP, RedditTypes.SortType.CONTROVERSIAL):
return sort_function(reddit_source, limit=self.args.limit, time_filter=self.time_filter.value)
else:
return sort_function(reddit_source, limit=self.args.limit)
def _get_user_data(self) -> list[Iterator]:
if any([self.args.submitted, self.args.upvoted, self.args.saved]):
if self.args.user:
try:
self._check_user_existence(self.args.user)
except errors.BulkDownloaderException as e:
logger.error(e)
return []
generators = []
if self.args.submitted:
logger.debug(f'Retrieving submitted posts of user {self.args.user}')
generators.append(self._create_filtered_listing_generator(
self.reddit_instance.redditor(self.args.user).submissions,
))
if not self.authenticated and any((self.args.upvoted, self.args.saved)):
logger.warning('Accessing user lists requires authentication')
else:
if self.args.upvoted:
logger.debug(f'Retrieving upvoted posts of user {self.args.user}')
generators.append(self.reddit_instance.redditor(self.args.user).upvoted(limit=self.args.limit))
if self.args.saved:
logger.debug(f'Retrieving saved posts of user {self.args.user}')
generators.append(self.reddit_instance.redditor(self.args.user).saved(limit=self.args.limit))
return generators
else:
logger.warning('A user must be supplied to download user data')
return []
else:
return []
def _check_user_existence(self, name: str):
user = self.reddit_instance.redditor(name=name)
try:
if user.id:
return
except prawcore.exceptions.NotFound:
raise errors.BulkDownloaderException(f'Could not find user {name}')
except AttributeError:
if hasattr(user, 'is_suspended'):
raise errors.BulkDownloaderException(f'User {name} is banned')
def _create_file_name_formatter(self) -> FileNameFormatter:
return FileNameFormatter(self.args.file_scheme, self.args.folder_scheme, self.args.time_format)
def _create_time_filter(self) -> RedditTypes.TimeType:
try:
return RedditTypes.TimeType[self.args.time.upper()]
except (KeyError, AttributeError):
return RedditTypes.TimeType.ALL
def _create_sort_filter(self) -> RedditTypes.SortType:
try:
return RedditTypes.SortType[self.args.sort.upper()]
except (KeyError, AttributeError):
return RedditTypes.SortType.HOT
def _create_download_filter(self) -> DownloadFilter:
return DownloadFilter(self.args.skip, self.args.skip_domain)
def _create_authenticator(self) -> SiteAuthenticator:
return SiteAuthenticator(self.cfg_parser)
def download(self):
for generator in self.reddit_lists:
for submission in generator:
if submission.id in self.excluded_submission_ids:
logger.debug(f'Object {submission.id} in exclusion list, skipping')
continue
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
else:
logger.debug(f'Attempting to download submission {submission.id}')
self._download_submission(submission)
self._download_submission(submission)
def _download_submission(self, submission: praw.models.Submission):
if not isinstance(submission, praw.models.Submission):
if submission.id in self.excluded_submission_ids:
logger.debug(f'Object {submission.id} in exclusion list, skipping')
return
elif submission.subreddit.display_name.lower() in self.args.skip_subreddit:
logger.debug(f'Submission {submission.id} in {submission.subreddit.display_name} in skip list')
return
elif (submission.author and submission.author.name in self.args.ignore_user) or \
(submission.author is None and 'DELETED' in self.args.ignore_user):
logger.debug(
f'Submission {submission.id} in {submission.subreddit.display_name} skipped'
f' due to {submission.author.name if submission.author else "DELETED"} being an ignored user')
return
elif not isinstance(submission, praw.models.Submission):
logger.warning(f'{submission.id} is not a submission')
return
elif not self.download_filter.check_url(submission.url):
logger.debug(f'Submission {submission.id} filtered due to URL {submission.url}')
return
logger.debug(f'Attempting to download submission {submission.id}')
try:
downloader_class = DownloadFactory.pull_lever(submission.url)
downloader = downloader_class(submission)
@@ -407,7 +72,9 @@ class RedditDownloader:
except errors.NotADownloadableLinkError as e:
logger.error(f'Could not download submission {submission.id}: {e}')
return
if downloader_class.__name__.lower() in self.args.disable_module:
logger.debug(f'Submission {submission.id} skipped due to disabled module {downloader_class.__name__}')
return
try:
content = downloader.find_resources(self.authenticator)
except errors.SiteDownloaderError as e:
@@ -415,34 +82,43 @@ class RedditDownloader:
return
for destination, res in self.file_name_formatter.format_resource_paths(content, self.download_directory):
if destination.exists():
logger.debug(f'File {destination} already exists, continuing')
logger.debug(f'File {destination} from submission {submission.id} already exists, continuing')
continue
elif not self.download_filter.check_resource(res):
logger.debug(f'Download filter removed {submission.id} with URL {submission.url}')
else:
try:
res.download(self.args.max_wait_time)
except errors.BulkDownloaderException as e:
logger.error(f'Failed to download resource {res.url} in submission {submission.id} '
f'with downloader {downloader_class.__name__}: {e}')
logger.debug(f'Download filter removed {submission.id} file with URL {submission.url}')
continue
try:
res.download({'max_wait_time': self.args.max_wait_time})
except errors.BulkDownloaderException as e:
logger.error(f'Failed to download resource {res.url} in submission {submission.id} '
f'with downloader {downloader_class.__name__}: {e}')
return
resource_hash = res.hash.hexdigest()
destination.parent.mkdir(parents=True, exist_ok=True)
if resource_hash in self.master_hash_list:
if self.args.no_dupes:
logger.info(
f'Resource hash {resource_hash} from submission {submission.id} downloaded elsewhere')
return
resource_hash = res.hash.hexdigest()
destination.parent.mkdir(parents=True, exist_ok=True)
if resource_hash in self.master_hash_list:
if self.args.no_dupes:
logger.info(
f'Resource hash {resource_hash} from submission {submission.id} downloaded elsewhere')
return
elif self.args.make_hard_links:
self.master_hash_list[resource_hash].link_to(destination)
logger.info(
f'Hard link made linking {destination} to {self.master_hash_list[resource_hash]}')
return
elif self.args.make_hard_links:
self.master_hash_list[resource_hash].link_to(destination)
logger.info(
f'Hard link made linking {destination} to {self.master_hash_list[resource_hash]}'
f' in submission {submission.id}')
return
try:
with open(destination, 'wb') as file:
file.write(res.content)
logger.debug(f'Written file to {destination}')
self.master_hash_list[resource_hash] = destination
logger.debug(f'Hash added to master list: {resource_hash}')
logger.info(f'Downloaded submission {submission.id} from {submission.subreddit.display_name}')
except OSError as e:
logger.exception(e)
logger.error(f'Failed to write file in submission {submission.id} to {destination}: {e}')
return
creation_time = time.mktime(datetime.fromtimestamp(submission.created_utc).timetuple())
os.utime(destination, (creation_time, creation_time))
self.master_hash_list[resource_hash] = destination
logger.debug(f'Hash added to master list: {resource_hash}')
logger.info(f'Downloaded submission {submission.id} from {submission.subreddit.display_name}')
@staticmethod
def scan_existing_files(directory: Path) -> dict[str, Path]:
@@ -457,27 +133,3 @@ class RedditDownloader:
hash_list = {res[1]: res[0] for res in results}
return hash_list
def _read_excluded_ids(self) -> set[str]:
out = []
out.extend(self.args.exclude_id)
for id_file in self.args.exclude_id_file:
id_file = Path(id_file).resolve().expanduser()
if not id_file.exists():
logger.warning(f'ID exclusion file at {id_file} does not exist')
continue
with open(id_file, 'r') as file:
for line in file:
out.append(line.strip())
return set(out)
@staticmethod
def _check_subreddit_status(subreddit: praw.models.Subreddit):
if subreddit.display_name == 'all':
return
try:
assert subreddit.id
except prawcore.NotFound:
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} does not exist or cannot be found')
except prawcore.Forbidden:
raise errors.BulkDownloaderException(f'Source {subreddit.display_name} is private and cannot be scraped')

View File

@@ -4,6 +4,7 @@ import datetime
import logging
import platform
import re
import subprocess
from pathlib import Path
from typing import Optional
@@ -104,32 +105,54 @@ class FileNameFormatter:
) -> Path:
subfolder = Path(
destination_directory,
*[self._format_name(resource.source_submission, part) for part in self.directory_format_string]
*[self._format_name(resource.source_submission, part) for part in self.directory_format_string],
)
index = f'_{str(index)}' if index else ''
if not resource.extension:
raise BulkDownloaderException(f'Resource from {resource.url} has no extension')
ending = index + resource.extension
file_name = str(self._format_name(resource.source_submission, self.file_format_string))
file_name = self._limit_file_name_length(file_name, ending)
if not re.match(r'.*\.$', file_name) and not re.match(r'^\..*', resource.extension):
ending = index + '.' + resource.extension
else:
ending = index + resource.extension
try:
file_path = Path(subfolder, file_name)
file_path = self.limit_file_name_length(file_name, ending, subfolder)
except TypeError:
raise BulkDownloaderException(f'Could not determine path name: {subfolder}, {index}, {resource.extension}')
return file_path
@staticmethod
def _limit_file_name_length(filename: str, ending: str) -> str:
def limit_file_name_length(filename: str, ending: str, root: Path) -> Path:
root = root.resolve().expanduser()
possible_id = re.search(r'((?:_\w{6})?$)', filename)
if possible_id:
ending = possible_id.group(1) + ending
filename = filename[:possible_id.start()]
max_length_chars = 255 - len(ending)
max_length_bytes = 255 - len(ending.encode('utf-8'))
while len(filename) > max_length_chars or len(filename.encode('utf-8')) > max_length_bytes:
max_path = FileNameFormatter.find_max_path_length()
max_file_part_length_chars = 255 - len(ending)
max_file_part_length_bytes = 255 - len(ending.encode('utf-8'))
max_path_length = max_path - len(ending) - len(str(root)) - 1
out = Path(root, filename + ending)
while any([len(filename) > max_file_part_length_chars,
len(filename.encode('utf-8')) > max_file_part_length_bytes,
len(str(out)) > max_path_length,
]):
filename = filename[:-1]
return filename + ending
out = Path(root, filename + ending)
return out
@staticmethod
def find_max_path_length() -> int:
try:
return int(subprocess.check_output(['getconf', 'PATH_MAX', '/']))
except (ValueError, subprocess.CalledProcessError, OSError):
if platform.system() == 'Windows':
return 260
else:
return 4096
def format_resource_paths(
self,

View File

@@ -5,8 +5,8 @@ import hashlib
import logging
import re
import time
from typing import Optional
import urllib.parse
from typing import Callable, Optional
import _hashlib
import requests
@@ -18,40 +18,26 @@ logger = logging.getLogger(__name__)
class Resource:
def __init__(self, source_submission: Submission, url: str, extension: str = None):
def __init__(self, source_submission: Submission, url: str, download_function: Callable, extension: str = None):
self.source_submission = source_submission
self.content: Optional[bytes] = None
self.url = url
self.hash: Optional[_hashlib.HASH] = None
self.extension = extension
self.download_function = download_function
if not self.extension:
self.extension = self._determine_extension()
@staticmethod
def retry_download(url: str, max_wait_time: int) -> Optional[bytes]:
wait_time = 60
try:
response = requests.get(url)
if re.match(r'^2\d{2}', str(response.status_code)) and response.content:
return response.content
elif response.status_code in (408, 429):
raise requests.exceptions.ConnectionError(f'Response code {response.status_code}')
else:
raise BulkDownloaderException(
f'Unrecoverable error requesting resource: HTTP Code {response.status_code}')
except requests.exceptions.ConnectionError as e:
logger.warning(f'Error occured downloading from {url}, waiting {wait_time} seconds: {e}')
time.sleep(wait_time)
if wait_time < max_wait_time:
return Resource.retry_download(url, max_wait_time)
else:
logger.error(f'Max wait time exceeded for resource at url {url}')
raise
def retry_download(url: str) -> Callable:
return lambda global_params: Resource.http_download(url, global_params)
def download(self, max_wait_time: int):
def download(self, download_parameters: Optional[dict] = None):
if download_parameters is None:
download_parameters = {}
if not self.content:
try:
content = self.retry_download(self.url, max_wait_time)
content = self.download_function(download_parameters)
except requests.exceptions.ConnectionError as e:
raise BulkDownloaderException(f'Could not download resource: {e}')
except BulkDownloaderException:
@@ -70,3 +56,30 @@ class Resource:
match = re.search(extension_pattern, stripped_url)
if match:
return match.group(1)
@staticmethod
def http_download(url: str, download_parameters: dict) -> Optional[bytes]:
headers = download_parameters.get('headers')
current_wait_time = 60
if 'max_wait_time' in download_parameters:
max_wait_time = download_parameters['max_wait_time']
else:
max_wait_time = 300
while True:
try:
response = requests.get(url, headers=headers)
if re.match(r'^2\d{2}', str(response.status_code)) and response.content:
return response.content
elif response.status_code in (408, 429):
raise requests.exceptions.ConnectionError(f'Response code {response.status_code}')
else:
raise BulkDownloaderException(
f'Unrecoverable error requesting resource: HTTP Code {response.status_code}')
except (requests.exceptions.ConnectionError, requests.exceptions.ChunkedEncodingError) as e:
logger.warning(f'Error occured downloading from {url}, waiting {current_wait_time} seconds: {e}')
time.sleep(current_wait_time)
if current_wait_time < max_wait_time:
current_wait_time += 60
else:
logger.error(f'Max wait time exceeded for resource at url {url}')
raise

View File

@@ -14,4 +14,4 @@ class Direct(BaseDownloader):
super().__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
return [Resource(self.post, self.post.url)]
return [Resource(self.post, self.post.url, Resource.retry_download(self.post.url))]

View File

@@ -9,27 +9,32 @@ from bdfr.exceptions import NotADownloadableLinkError
from bdfr.site_downloaders.base_downloader import BaseDownloader
from bdfr.site_downloaders.direct import Direct
from bdfr.site_downloaders.erome import Erome
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
from bdfr.site_downloaders.gallery import Gallery
from bdfr.site_downloaders.gfycat import Gfycat
from bdfr.site_downloaders.imgur import Imgur
from bdfr.site_downloaders.pornhub import PornHub
from bdfr.site_downloaders.redgifs import Redgifs
from bdfr.site_downloaders.self_post import SelfPost
from bdfr.site_downloaders.vidble import Vidble
from bdfr.site_downloaders.youtube import Youtube
class DownloadFactory:
@staticmethod
def pull_lever(url: str) -> Type[BaseDownloader]:
sanitised_url = DownloadFactory._sanitise_url(url)
if re.match(r'(i\.)?imgur.*\.gifv$', sanitised_url):
sanitised_url = DownloadFactory.sanitise_url(url)
if re.match(r'(i\.)?imgur.*\.gif.+$', sanitised_url):
return Imgur
elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url):
elif re.match(r'.*/.*\.\w{3,4}(\?[\w;&=]*)?$', sanitised_url) and \
not DownloadFactory.is_web_resource(sanitised_url):
return Direct
elif re.match(r'erome\.com.*', sanitised_url):
return Erome
elif re.match(r'reddit\.com/gallery/.*', sanitised_url):
return Gallery
elif re.match(r'patreon\.com.*', sanitised_url):
return Gallery
elif re.match(r'gfycat\.', sanitised_url):
return Gfycat
elif re.match(r'(m\.)?imgur.*', sanitised_url):
@@ -42,16 +47,39 @@ class DownloadFactory:
return Youtube
elif re.match(r'i\.redd\.it.*', sanitised_url):
return Direct
elif YoutubeDlFallback.can_handle_link(sanitised_url):
return YoutubeDlFallback
elif re.match(r'pornhub\.com.*', sanitised_url):
return PornHub
elif re.match(r'vidble\.com', sanitised_url):
return Vidble
elif YtdlpFallback.can_handle_link(sanitised_url):
return YtdlpFallback
else:
raise NotADownloadableLinkError(
f'No downloader module exists for url {url}')
raise NotADownloadableLinkError(f'No downloader module exists for url {url}')
@staticmethod
def _sanitise_url(url: str) -> str:
def sanitise_url(url: str) -> str:
beginning_regex = re.compile(r'\s*(www\.?)?')
split_url = urllib.parse.urlsplit(url)
split_url = split_url.netloc + split_url.path
split_url = re.sub(beginning_regex, '', split_url)
return split_url
@staticmethod
def is_web_resource(url: str) -> bool:
web_extensions = (
'asp',
'aspx',
'cfm',
'cfml',
'css',
'htm',
'html',
'js',
'php',
'php3',
'xhtml',
)
if re.match(rf'(?i).*/.*\.({"|".join(web_extensions)})$', url):
return True
else:
return False

View File

@@ -2,7 +2,7 @@
import logging
import re
from typing import Optional
from typing import Callable, Optional
import bs4
from praw.models import Submission
@@ -29,7 +29,7 @@ class Erome(BaseDownloader):
for link in links:
if not re.match(r'https?://.*', link):
link = 'https://' + link
out.append(Resource(self.post, link))
out.append(Resource(self.post, link, self.erome_download(link)))
return out
@staticmethod
@@ -43,3 +43,14 @@ class Erome(BaseDownloader):
out.extend([vid.get('src') for vid in videos])
return set(out)
@staticmethod
def erome_download(url: str) -> Callable:
download_parameters = {
'headers': {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/88.0.4324.104 Safari/537.36',
'Referer': 'https://www.erome.com/',
},
}
return lambda global_params: Resource.http_download(url, global_params | download_parameters)

View File

@@ -4,9 +4,9 @@
import logging
from typing import Optional
import youtube_dl
from praw.models import Submission
from bdfr.exceptions import NotADownloadableLinkError
from bdfr.resource import Resource
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.site_downloaders.fallback_downloaders.fallback_downloader import BaseFallbackDownloader
@@ -15,26 +15,24 @@ from bdfr.site_downloaders.youtube import Youtube
logger = logging.getLogger(__name__)
class YoutubeDlFallback(BaseFallbackDownloader, Youtube):
class YtdlpFallback(BaseFallbackDownloader, Youtube):
def __init__(self, post: Submission):
super(YoutubeDlFallback, self).__init__(post)
super(YtdlpFallback, self).__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
out = super()._download_video({})
out = Resource(
self.post,
self.post.url,
super()._download_video({}),
super().get_video_attributes(self.post.url)['ext'],
)
return [out]
@staticmethod
def can_handle_link(url: str) -> bool:
yt_logger = logging.getLogger('youtube-dl')
yt_logger.setLevel(logging.CRITICAL)
with youtube_dl.YoutubeDL({
'logger': yt_logger,
}) as ydl:
try:
result = ydl.extract_info(url, download=False)
if result:
return True
except Exception as e:
logger.exception(e)
return False
return False
try:
attributes = YtdlpFallback.get_video_attributes(url)
except NotADownloadableLinkError:
return False
if attributes:
return True

View File

@@ -1,10 +1,9 @@
#!/usr/bin/env python3
import logging
import re
from typing import Optional
import bs4
import requests
from praw.models import Submission
from bdfr.exceptions import SiteDownloaderError
@@ -20,21 +19,30 @@ class Gallery(BaseDownloader):
super().__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
image_urls = self._get_links(self.post.url)
try:
image_urls = self._get_links(self.post.gallery_data['items'])
except (AttributeError, TypeError):
try:
image_urls = self._get_links(self.post.crosspost_parent_list[0]['gallery_data']['items'])
except (AttributeError, IndexError, TypeError, KeyError):
logger.error(f'Could not find gallery data in submission {self.post.id}')
logger.exception('Gallery image find failure')
raise SiteDownloaderError('No images found in Reddit gallery')
if not image_urls:
raise SiteDownloaderError('No images found in Reddit gallery')
return [Resource(self.post, url) for url in image_urls]
return [Resource(self.post, url, Resource.retry_download(url)) for url in image_urls]
@staticmethod
def _get_links(url: str) -> list[str]:
resource_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/67.0.3396.87 Safari/537.36 OPR/54.0.2952.64',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
page = Gallery.retrieve_url(url, headers=resource_headers)
soup = bs4.BeautifulSoup(page.text, 'html.parser')
links = soup.findAll('a', attrs={'target': '_blank', 'href': re.compile(r'https://preview\.redd\.it.*')})
links = [link.get('href') for link in links]
return links
@ staticmethod
def _get_links(id_dict: list[dict]) -> list[str]:
out = []
for item in id_dict:
image_id = item['media_id']
possible_extensions = ('.jpg', '.png', '.gif', '.gifv', '.jpeg')
for extension in possible_extensions:
test_url = f'https://i.redd.it/{image_id}{extension}'
response = requests.head(test_url)
if response.status_code == 200:
out.append(test_url)
break
return out

View File

@@ -27,6 +27,7 @@ class Gfycat(Redgifs):
response = Gfycat.retrieve_url(url)
if re.search(r'(redgifs|gifdeliverynetwork)', response.url):
url = url.lower() # Fixes error with old gfycat/redgifs links
return Redgifs._get_link(url)
soup = BeautifulSoup(response.text, 'html.parser')

View File

@@ -32,14 +32,19 @@ class Imgur(BaseDownloader):
return out
def _compute_image_url(self, image: dict) -> Resource:
image_url = 'https://i.imgur.com/' + image['hash'] + self._validate_extension(image['ext'])
return Resource(self.post, image_url)
ext = self._validate_extension(image['ext'])
if image.get('prefer_video', False):
ext = '.mp4'
image_url = 'https://i.imgur.com/' + image['hash'] + ext
return Resource(self.post, image_url, Resource.retry_download(image_url))
@staticmethod
def _get_data(link: str) -> dict:
if re.match(r'.*\.gifv$', link):
link = link.rstrip('?')
if re.match(r'(?i).*\.gif.+$', link):
link = link.replace('i.imgur', 'imgur')
link = link.rstrip('.gifv')
link = re.sub('(?i)\\.gif.+$', '', link)
res = Imgur.retrieve_url(link, cookies={'over18': '1', 'postpagebeta': '0'})
@@ -71,6 +76,7 @@ class Imgur(BaseDownloader):
@staticmethod
def _validate_extension(extension_suffix: str) -> str:
extension_suffix = extension_suffix.strip('?1')
possible_extensions = ('.jpg', '.png', '.mp4', '.gif')
selection = [ext for ext in possible_extensions if ext == extension_suffix]
if len(selection) == 1:

View File

@@ -0,0 +1,37 @@
#!/usr/bin/env python3
# coding=utf-8
import logging
from typing import Optional
from praw.models import Submission
from bdfr.exceptions import SiteDownloaderError
from bdfr.resource import Resource
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.site_downloaders.youtube import Youtube
logger = logging.getLogger(__name__)
class PornHub(Youtube):
def __init__(self, post: Submission):
super().__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
ytdl_options = {
'format': 'best',
'nooverwrites': True,
}
if video_attributes := super().get_video_attributes(self.post.url):
extension = video_attributes['ext']
else:
raise SiteDownloaderError()
out = Resource(
self.post,
self.post.url,
super()._download_video(ytdl_options),
extension,
)
return [out]

View File

@@ -4,7 +4,6 @@ import json
import re
from typing import Optional
from bs4 import BeautifulSoup
from praw.models import Submission
from bdfr.exceptions import SiteDownloaderError
@@ -19,7 +18,7 @@ class Redgifs(BaseDownloader):
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
media_url = self._get_link(self.post.url)
return [Resource(self.post, media_url, '.mp4')]
return [Resource(self.post, media_url, Resource.retry_download(media_url), '.mp4')]
@staticmethod
def _get_link(url: str) -> str:

View File

@@ -17,7 +17,7 @@ class SelfPost(BaseDownloader):
super().__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
out = Resource(self.post, self.post.url, '.txt')
out = Resource(self.post, self.post.url, lambda: None, '.txt')
out.content = self.export_to_string().encode('utf-8')
out.create_hash()
return [out]

View File

@@ -0,0 +1,54 @@
#!/usr/bin/env python3
# coding=utf-8
import itertools
import logging
import re
from typing import Optional
import bs4
import requests
from praw.models import Submission
from bdfr.exceptions import SiteDownloaderError
from bdfr.resource import Resource
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.site_downloaders.base_downloader import BaseDownloader
logger = logging.getLogger(__name__)
class Vidble(BaseDownloader):
def __init__(self, post: Submission):
super().__init__(post)
def find_resources(self, authenticator: Optional[SiteAuthenticator] = None) -> list[Resource]:
try:
res = self.get_links(self.post.url)
except AttributeError:
raise SiteDownloaderError(f'Could not read page at {self.post.url}')
if not res:
raise SiteDownloaderError(rf'No resources found at {self.post.url}')
res = [Resource(self.post, r, Resource.retry_download(r)) for r in res]
return res
@staticmethod
def get_links(url: str) -> set[str]:
if not re.search(r'vidble.com/(show/|album/|watch\?v)', url):
url = re.sub(r'/(\w*?)$', r'/show/\1', url)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.text, 'html.parser')
content_div = soup.find('div', attrs={'id': 'ContentPlaceHolder1_divContent'})
images = content_div.find_all('img')
images = [i.get('src') for i in images]
videos = content_div.find_all('source', attrs={'type': 'video/mp4'})
videos = [v.get('src') for v in videos]
resources = filter(None, itertools.chain(images, videos))
resources = ['https://www.vidble.com' + r for r in resources]
resources = [Vidble.change_med_url(r) for r in resources]
return set(resources)
@staticmethod
def change_med_url(url: str) -> str:
out = re.sub(r'_med(\..{3,4})$', r'\1', url)
return out

View File

@@ -3,12 +3,12 @@
import logging
import tempfile
from pathlib import Path
from typing import Optional
from typing import Callable, Optional
import youtube_dl
import yt_dlp
from praw.models import Submission
from bdfr.exceptions import SiteDownloaderError
from bdfr.exceptions import NotADownloadableLinkError, SiteDownloaderError
from bdfr.resource import Resource
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.site_downloaders.base_downloader import BaseDownloader
@@ -26,28 +26,48 @@ class Youtube(BaseDownloader):
'playlistend': 1,
'nooverwrites': True,
}
out = self._download_video(ytdl_options)
return [out]
download_function = self._download_video(ytdl_options)
extension = self.get_video_attributes(self.post.url)['ext']
res = Resource(self.post, self.post.url, download_function, extension)
return [res]
def _download_video(self, ytdl_options: dict) -> Resource:
def _download_video(self, ytdl_options: dict) -> Callable:
yt_logger = logging.getLogger('youtube-dl')
yt_logger.setLevel(logging.CRITICAL)
ytdl_options['quiet'] = True
ytdl_options['logger'] = yt_logger
with tempfile.TemporaryDirectory() as temp_dir:
download_path = Path(temp_dir).resolve()
ytdl_options['outtmpl'] = str(download_path) + '/' + 'test.%(ext)s'
try:
with youtube_dl.YoutubeDL(ytdl_options) as ydl:
ydl.download([self.post.url])
except youtube_dl.DownloadError as e:
raise SiteDownloaderError(f'Youtube download failed: {e}')
downloaded_file = list(download_path.iterdir())[0]
extension = downloaded_file.suffix
with open(downloaded_file, 'rb') as file:
content = file.read()
out = Resource(self.post, self.post.url, extension)
out.content = content
out.create_hash()
return out
def download(_: dict) -> bytes:
with tempfile.TemporaryDirectory() as temp_dir:
download_path = Path(temp_dir).resolve()
ytdl_options['outtmpl'] = str(download_path) + '/' + 'test.%(ext)s'
try:
with yt_dlp.YoutubeDL(ytdl_options) as ydl:
ydl.download([self.post.url])
except yt_dlp.DownloadError as e:
raise SiteDownloaderError(f'Youtube download failed: {e}')
downloaded_files = list(download_path.iterdir())
if len(downloaded_files) > 0:
downloaded_file = downloaded_files[0]
else:
raise NotADownloadableLinkError(f"No media exists in the URL {self.post.url}")
with open(downloaded_file, 'rb') as file:
content = file.read()
return content
return download
@staticmethod
def get_video_attributes(url: str) -> dict:
yt_logger = logging.getLogger('youtube-dl')
yt_logger.setLevel(logging.CRITICAL)
with yt_dlp.YoutubeDL({'logger': yt_logger, }) as ydl:
try:
result = ydl.extract_info(url, download=False)
except Exception as e:
logger.exception(e)
raise NotADownloadableLinkError(f'Video info extraction failed for {url}')
if 'ext' in result:
return result
else:
raise NotADownloadableLinkError(f'Video info extraction failed for {url}')

View File

@@ -1,2 +1,5 @@
copy .\\bdfr\\default_config.cfg .\\test_config.cfg
echo "`nuser_token = $env:REDDIT_TOKEN" >> ./test_config.cfg
if (-not ([string]::IsNullOrEmpty($env:REDDIT_TOKEN)))
{
copy .\\bdfr\\default_config.cfg .\\test_config.cfg
echo "`nuser_token = $env:REDDIT_TOKEN" >> ./test_config.cfg
}

View File

@@ -1,2 +1,5 @@
cp ./bdfr/default_config.cfg ./test_config.cfg
echo -e "\nuser_token = $REDDIT_TOKEN" >> ./test_config.cfg
if [ ! -z "$REDDIT_TOKEN" ]
then
cp ./bdfr/default_config.cfg ./test_config.cfg
echo -e "\nuser_token = $REDDIT_TOKEN" >> ./test_config.cfg
fi

View File

@@ -6,4 +6,4 @@ ffmpeg-python>=0.2.0
praw>=7.2.0
pyyaml>=5.4.1
requests>=2.25.1
youtube-dl>=2021.3.14
yt-dlp>=2021.9.25

View File

@@ -0,0 +1,21 @@
if (Test-Path -Path $args[0] -PathType Leaf) {
$file=$args[0]
}
else {
Write-Host "CANNOT FIND LOG FILE"
Exit 1
}
if ($args[1] -ne $null) {
$output=$args[1]
Write-Host "Outputting IDs to $output"
}
else {
$output="./failed.txt"
}
Select-String -Path $file -Pattern "Could not download submission" | ForEach-Object { -split $_.Line | Select-Object -Skip 11 | Select-Object -First 1 } | foreach { $_.substring(0,$_.Length-1) } >> $output
Select-String -Path $file -Pattern "Failed to download resource" | ForEach-Object { -split $_.Line | Select-Object -Skip 14 | Select-Object -First 1 } >> $output
Select-String -Path $file -Pattern "failed to download submission" | ForEach-Object { -split $_.Line | Select-Object -Skip 13 | Select-Object -First 1 } | foreach { $_.substring(0,$_.Length-1) } >> $output
Select-String -Path $file -Pattern "Failed to write file" | ForEach-Object { -split $_.Line | Select-Object -Skip 13 | Select-Object -First 1 } >> $output
Select-String -Path $file -Pattern "skipped due to disabled module" | ForEach-Object { -split $_.Line | Select-Object -Skip 8 | Select-Object -First 1 } >> $output

View File

@@ -11,8 +11,13 @@ if [ -n "$2" ]; then
output="$2"
echo "Outputting IDs to $output"
else
output="failed.txt"
output="./failed.txt"
fi
grep 'Could not download submission' "$file" | awk '{ print $12 }' | rev | cut -c 2- | rev >>"$output"
grep 'Failed to download resource' "$file" | awk '{ print $15 }' >>"$output"
{
grep 'Could not download submission' "$file" | awk '{ print $12 }' | rev | cut -c 2- | rev ;
grep 'Failed to download resource' "$file" | awk '{ print $15 }' ;
grep 'failed to download submission' "$file" | awk '{ print $14 }' | rev | cut -c 2- | rev ;
grep 'Failed to write file' "$file" | awk '{ print $14 }' ;
grep 'skipped due to disabled module' "$file" | awk '{ print $9 }' ;
} >>"$output"

View File

@@ -0,0 +1,21 @@
if (Test-Path -Path $args[0] -PathType Leaf) {
$file=$args[0]
}
else {
Write-Host "CANNOT FIND LOG FILE"
Exit 1
}
if ($args[1] -ne $null) {
$output=$args[1]
Write-Host "Outputting IDs to $output"
}
else {
$output="./successful.txt"
}
Select-String -Path $file -Pattern "Downloaded submission" | ForEach-Object { -split $_.Line | Select-Object -Last 3 | Select-Object -SkipLast 2 } >> $output
Select-String -Path $file -Pattern "Resource hash" | ForEach-Object { -split $_.Line | Select-Object -Last 3 | Select-Object -SkipLast 2 } >> $output
Select-String -Path $file -Pattern "Download filter" | ForEach-Object { -split $_.Line | Select-Object -Last 4 | Select-Object -SkipLast 3 } >> $output
Select-String -Path $file -Pattern "already exists, continuing" | ForEach-Object { -split $_.Line | Select-Object -Last 4 | Select-Object -SkipLast 3 } >> $output
Select-String -Path $file -Pattern "Hard link made" | ForEach-Object { -split $_.Line | Select-Object -Last 1 } >> $output

View File

@@ -11,7 +11,13 @@ if [ -n "$2" ]; then
output="$2"
echo "Outputting IDs to $output"
else
output="successful.txt"
output="./successful.txt"
fi
grep 'Downloaded submission' "$file" | awk '{ print $(NF-2) }' >> "$output"
{
grep 'Downloaded submission' "$file" | awk '{ print $(NF-2) }' ;
grep 'Resource hash' "$file" | awk '{ print $(NF-2) }' ;
grep 'Download filter' "$file" | awk '{ print $(NF-3) }' ;
grep 'already exists, continuing' "$file" | awk '{ print $(NF-3) }' ;
grep 'Hard link made' "$file" | awk '{ print $(NF) }' ;
} >> "$output"

30
scripts/print_summary.ps1 Normal file
View File

@@ -0,0 +1,30 @@
if (Test-Path -Path $args[0] -PathType Leaf) {
$file=$args[0]
}
else {
Write-Host "CANNOT FIND LOG FILE"
Exit 1
}
if ($args[1] -ne $null) {
$output=$args[1]
Write-Host "Outputting IDs to $output"
}
else {
$output="./successful.txt"
}
Write-Host -NoNewline "Downloaded submissions: "
Write-Host (Select-String -Path $file -Pattern "Downloaded submission" -AllMatches).Matches.Count
Write-Host -NoNewline "Failed downloads: "
Write-Host (Select-String -Path $file -Pattern "failed to download submission" -AllMatches).Matches.Count
Write-Host -NoNewline "Files already downloaded: "
Write-Host (Select-String -Path $file -Pattern "already exists, continuing" -AllMatches).Matches.Count
Write-Host -NoNewline "Hard linked submissions: "
Write-Host (Select-String -Path $file -Pattern "Hard link made" -AllMatches).Matches.Count
Write-Host -NoNewline "Excluded submissions: "
Write-Host (Select-String -Path $file -Pattern "in exclusion list" -AllMatches).Matches.Count
Write-Host -NoNewline "Files with existing hash skipped: "
Write-Host (Select-String -Path $file -Pattern "downloaded elsewhere" -AllMatches).Matches.Count
Write-Host -NoNewline "Submissions from excluded subreddits: "
Write-Host (Select-String -Path $file -Pattern "in skip list" -AllMatches).Matches.Count

13
scripts/tests/README.md Normal file
View File

@@ -0,0 +1,13 @@
# Bash Scripts Testing
The `bats` framework is included and used to test the scripts included, specifically the scripts designed to parse through the logging output. As this involves delicate regex and indexes, it is necessary to test these.
## Running Tests
Running the tests are easy, and can be done with a single command. Once the working directory is this directory, run the following command.
```bash
./bats/bin/bats *.bats
```
This will run all test files that have the `.bats` suffix.

1
scripts/tests/bats Submodule

Submodule scripts/tests/bats added at ce5ca2802f

View File

@@ -0,0 +1 @@
[2021-06-12 12:49:18,452 - bdfr.downloader - DEBUG] - Submission m2601g skipped due to disabled module Direct

View File

@@ -0,0 +1,3 @@
[2021-06-12 11:13:35,665 - bdfr.downloader - ERROR] - Could not download submission nxv3ew: No downloader module exists for url https://www.biorxiv.org/content/10.1101/2021.06.11.447961v1?rss=1
[2021-06-12 11:14:21,958 - bdfr.downloader - ERROR] - Could not download submission nxv3ek: No downloader module exists for url https://alkossegyedit.hu/termek/pluss-macko-poloval-20cm/?feed_id=34832&_unique_id=60c40a1190ccb&utm_source=Reddit&utm_medium=AEAdmin&utm_campaign=Poster
[2021-06-12 11:17:53,456 - bdfr.downloader - ERROR] - Could not download submission nxv3ea: No downloader module exists for url https://www.biorxiv.org/content/10.1101/2021.06.11.448067v1?rss=1

View File

@@ -0,0 +1,2 @@
[2021-06-12 11:18:25,794 - bdfr.downloader - ERROR] - Failed to download resource https://i.redd.it/61fniokpjq471.jpg in submission nxv3dt with downloader Direct: Unrecoverable error requesting resource: HTTP Code 404

View File

@@ -0,0 +1,2 @@
[2021-06-12 08:38:35,657 - bdfr.downloader - ERROR] - Site Gallery failed to download submission nxr7x9: No images found in Reddit gallery
[2021-06-12 08:47:22,005 - bdfr.downloader - ERROR] - Site Gallery failed to download submission nxpn0h: Server responded with 503 to https://www.reddit.com/gallery/nxpkvh

View File

@@ -0,0 +1 @@
[2021-06-09 22:01:04,530 - bdfr.downloader - ERROR] - Failed to write file in submission nnboza to C:\Users\Yoga 14\path\to\output\ThotNetwork\KatieCarmine_I POST A NEW VIDEO ALMOST EVERYDAY AND YOU NEVER HAVE TO PAY EXTRA FOR IT! I want to share my sex life with you! Only $6 per month and you get full access to over 400 videos of me getting fuck_nnboza.mp4: [Errno 2] No such file or directory: 'C:\\Users\\Yoga 14\\path\\to\\output\\ThotNetwork\\KatieCarmine_I POST A NEW VIDEO ALMOST EVERYDAY AND YOU NEVER HAVE TO PAY EXTRA FOR IT! I want to share my sex life with you! Only $6 per month and you get full access to over 400 videos of me getting fuck_nnboza.mp4'

View File

@@ -0,0 +1,3 @@
[2021-06-12 08:41:51,464 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxry0l.jpg from submission nxry0l already exists, continuing
[2021-06-12 08:41:51,469 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxrlgn.gif from submission nxrlgn already exists, continuing
[2021-06-12 08:41:51,472 - bdfr.downloader - DEBUG] - File /media/smaug/private/reddit/tumblr/nxrq9g.png from submission nxrq9g already exists, continuing

View File

@@ -0,0 +1,3 @@
[2021-06-10 20:36:48,722 - bdfr.downloader - DEBUG] - Download filter removed nwfirr with URL https://www.youtube.com/watch?v=NVSiX0Tsees
[2021-06-12 19:56:36,848 - bdfr.downloader - DEBUG] - Download filter removed nwfgcl with URL https://www.reddit.com/r/MaliciousCompliance/comments/nwfgcl/new_guy_decided_to_play_manager_alright/
[2021-06-12 19:56:28,587 - bdfr.downloader - DEBUG] - Download filter removed nxuxjy with URL https://www.reddit.com/r/MaliciousCompliance/comments/nxuxjy/you_want_an_omelette_with_nothing_inside_okay/

View File

@@ -0,0 +1,7 @@
[2021-06-12 11:58:53,864 - bdfr.downloader - INFO] - Downloaded submission nxui9y from tumblr
[2021-06-12 11:58:56,618 - bdfr.downloader - INFO] - Downloaded submission nxsr4r from tumblr
[2021-06-12 11:58:59,026 - bdfr.downloader - INFO] - Downloaded submission nxviir from tumblr
[2021-06-12 11:59:00,289 - bdfr.downloader - INFO] - Downloaded submission nxusva from tumblr
[2021-06-12 11:59:00,735 - bdfr.downloader - INFO] - Downloaded submission nxvko7 from tumblr
[2021-06-12 11:59:01,215 - bdfr.downloader - INFO] - Downloaded submission nxvd63 from tumblr
[2021-06-12 11:59:13,891 - bdfr.downloader - INFO] - Downloaded submission nn9cor from tumblr

View File

@@ -0,0 +1 @@
[2021-06-11 17:33:02,118 - bdfr.downloader - INFO] - Hard link made linking /media/smaug/private/reddit/tumblr/nwnp2n.jpg to /media/smaug/private/reddit/tumblr/nwskqb.jpg in submission nwnp2n

View File

@@ -0,0 +1 @@
[2021-06-11 17:33:02,118 - bdfr.downloader - INFO] - Resource hash aaaaaaaaaaaaaaaaaaaaaaa from submission n86jk8 downloaded elsewhere

View File

@@ -0,0 +1,43 @@
setup() {
load ./test_helper/bats-support/load
load ./test_helper/bats-assert/load
}
teardown() {
rm -f failed.txt
}
@test "fail run no logfile" {
run ../extract_failed_ids.sh
assert_failure
}
@test "fail no downloader module" {
run ../extract_failed_ids.sh ./example_logfiles/failed_no_downloader.txt
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "3" ];
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
}
@test "fail resource error" {
run ../extract_failed_ids.sh ./example_logfiles/failed_resource_error.txt
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
}
@test "fail site downloader error" {
run ../extract_failed_ids.sh ./example_logfiles/failed_sitedownloader_error.txt
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "2" ];
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
}
@test "fail failed file write" {
run ../extract_failed_ids.sh ./example_logfiles/failed_write_error.txt
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
}
@test "fail disabled module" {
run ../extract_failed_ids.sh ./example_logfiles/failed_disabled_module.txt
assert [ "$( wc -l 'failed.txt' | awk '{ print $1 }' )" -eq "1" ];
assert [ "$( grep -Ecv '\w{6,7}' 'failed.txt' )" -eq "0" ];
}

View File

@@ -0,0 +1,38 @@
setup() {
load ./test_helper/bats-support/load
load ./test_helper/bats-assert/load
}
teardown() {
rm -f successful.txt
}
@test "success downloaded submission" {
run ../extract_successful_ids.sh ./example_logfiles/succeed_downloaded_submission.txt
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "7" ];
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
}
@test "success resource hash" {
run ../extract_successful_ids.sh ./example_logfiles/succeed_resource_hash.txt
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "1" ];
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
}
@test "success download filter" {
run ../extract_successful_ids.sh ./example_logfiles/succeed_download_filter.txt
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "3" ];
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
}
@test "success already exists" {
run ../extract_successful_ids.sh ./example_logfiles/succeed_already_exists.txt
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "3" ];
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
}
@test "success hard link" {
run ../extract_successful_ids.sh ./example_logfiles/succeed_hard_link.txt
assert [ "$( wc -l 'successful.txt' | awk '{ print $1 }' )" -eq "1" ];
assert [ "$( grep -Ecv '\w{6,7}' 'successful.txt' )" -eq "0" ];
}

View File

@@ -4,7 +4,7 @@ description_file = README.md
description_content_type = text/markdown
home_page = https://github.com/aliparlakci/bulk-downloader-for-reddit
keywords = reddit, download, archive
version = 2.1.0
version = 2.5.2
author = Ali Parlakci
author_email = parlakciali@gmail.com
maintainer = Serene Arc

View File

@@ -15,6 +15,7 @@ from bdfr.archive_entry.comment_archive_entry import CommentArchiveEntry
'subreddit': 'Python',
'submission': 'mgi4op',
'submission_title': '76% Faster CPython',
'distinguished': None,
}),
))
def test_get_comment_details(test_comment_id: str, expected_dict: dict, reddit_instance: praw.Reddit):

View File

@@ -26,6 +26,13 @@ def test_get_comments(test_submission_id: str, min_comments: int, reddit_instanc
'author': 'sinjen-tos',
'id': 'm3reby',
'link_flair_text': 'image',
'pinned': False,
'spoiler': False,
'over_18': False,
'locked': False,
'distinguished': None,
'created_utc': 1615583837,
'permalink': '/r/australia/comments/m3reby/this_little_guy_fell_out_of_a_tree_and_in_front/'
}),
('m3kua3', {'author': 'DELETED'}),
))

View File

@@ -0,0 +1,2 @@
#!/usr/bin/env python3
# coding=utf-8

View File

@@ -0,0 +1,123 @@
#!/usr/bin/env python3
# coding=utf-8
import re
import shutil
from pathlib import Path
import pytest
from click.testing import CliRunner
from bdfr.__main__ import cli
does_test_config_exist = Path('../test_config.cfg').exists()
def copy_test_config(run_path: Path):
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
def create_basic_args_for_archive_runner(test_args: list[str], run_path: Path):
copy_test_config(run_path)
out = [
'archive',
str(run_path),
'-v',
'--config', str(Path(run_path, '../test_config.cfg')),
'--log', str(Path(run_path, 'test_log.txt')),
] + test_args
return out
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['-l', 'gstd4hk'],
['-l', 'm2601g', '-f', 'yaml'],
['-l', 'n60t4c', '-f', 'xml'],
))
def test_cli_archive_single(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'Mindustry', '-L', 25],
['--subreddit', 'Mindustry', '-L', 25, '--format', 'xml'],
['--subreddit', 'Mindustry', '-L', 25, '--format', 'yaml'],
['--subreddit', 'Mindustry', '-L', 25, '--sort', 'new'],
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day'],
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day', '--sort', 'new'],
))
def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--user', 'me', '--authenticate', '--all-comments', '-L', '10'],
['--user', 'me', '--user', 'djnish', '--authenticate', '--all-comments', '-L', '10'],
))
def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--comment-context', '--link', 'gxqapql'],
))
def test_cli_archive_full_context(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Converting comment' in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.slow
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'all', '-L', 100],
['--subreddit', 'all', '-L', 100, '--sort', 'new'],
))
def test_cli_archive_long(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
))
def test_cli_archive_ignore_user(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'being an ignored user' in result.output
assert 'Attempting to archive submission' not in result.output

View File

@@ -0,0 +1,43 @@
#!/usr/bin/env python3
# coding=utf-8
import shutil
from pathlib import Path
import pytest
from click.testing import CliRunner
from bdfr.__main__ import cli
does_test_config_exist = Path('../test_config.cfg').exists()
def copy_test_config(run_path: Path):
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
def create_basic_args_for_cloner_runner(test_args: list[str], tmp_path: Path):
out = [
'clone',
str(tmp_path),
'-v',
'--config', 'test_config.cfg',
'--log', str(Path(tmp_path, 'test_log.txt')),
] + test_args
return out
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['-l', 'm2601g'],
['-s', 'TrollXChromosomes/', '-L', 1],
))
def test_cli_scrape_general(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_cloner_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Downloaded submission' in result.output
assert 'Record for entry item' in result.output

View File

@@ -1,7 +1,7 @@
#!/usr/bin/env python3
# coding=utf-8
import re
import shutil
from pathlib import Path
import pytest
@@ -9,26 +9,20 @@ from click.testing import CliRunner
from bdfr.__main__ import cli
does_test_config_exist = Path('test_config.cfg').exists()
does_test_config_exist = Path('../test_config.cfg').exists()
def create_basic_args_for_download_runner(test_args: list[str], tmp_path: Path):
def copy_test_config(run_path: Path):
shutil.copy(Path('../test_config.cfg'), Path(run_path, '../test_config.cfg'))
def create_basic_args_for_download_runner(test_args: list[str], run_path: Path):
copy_test_config(run_path)
out = [
'download', str(tmp_path),
'download', str(run_path),
'-v',
'--config', 'test_config.cfg',
'--log', str(Path(tmp_path, 'test_log.txt')),
] + test_args
return out
def create_basic_args_for_archive_runner(test_args: list[str], tmp_path: Path):
out = [
'archive',
str(tmp_path),
'-v',
'--config', 'test_config.cfg',
'--log', str(Path(tmp_path, 'test_log.txt')),
'--config', str(Path(run_path, '../test_config.cfg')),
'--log', str(Path(run_path, 'test_log.txt')),
] + test_args
return out
@@ -61,6 +55,38 @@ def test_cli_download_subreddits(test_args: list[str], tmp_path: Path):
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Added submissions from subreddit ' in result.output
assert 'Downloaded submission' in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.authenticated
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['-s', 'hentai', '-L', 10, '--search', 'red', '--authenticate'],
))
def test_cli_download_search_subreddits_authenticated(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Added submissions from subreddit ' in result.output
assert 'Downloaded submission' in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.authenticated
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'friends', '-L', 10, '--authenticate'],
))
def test_cli_download_user_specific_subreddits(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Added submissions from subreddit ' in result.output
@pytest.mark.online
@@ -117,6 +143,7 @@ def test_cli_download_multireddit_nonexistent(test_args: list[str], tmp_path: Pa
@pytest.mark.authenticated
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--user', 'djnish', '--submitted', '--user', 'FriesWithThat', '-L', 10],
['--user', 'me', '--upvoted', '--authenticate', '-L', 10],
['--user', 'me', '--saved', '--authenticate', '-L', 10],
['--user', 'me', '--submitted', '--authenticate', '-L', 10],
@@ -151,7 +178,7 @@ def test_cli_download_user_data_bad_me_unauthenticated(test_args: list[str], tmp
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'python', '-L', 10, '--search-existing'],
['--subreddit', 'python', '-L', 1, '--search-existing'],
))
def test_cli_download_search_existing(test_args: list[str], tmp_path: Path):
Path(tmp_path, 'test.txt').touch()
@@ -168,13 +195,14 @@ def test_cli_download_search_existing(test_args: list[str], tmp_path: Path):
@pytest.mark.parametrize('test_args', (
['--subreddit', 'tumblr', '-L', '25', '--skip', 'png', '--skip', 'jpg'],
['--subreddit', 'MaliciousCompliance', '-L', '25', '--skip', 'txt'],
['--subreddit', 'tumblr', '-L', '10', '--skip-domain', 'i.redd.it'],
))
def test_cli_download_download_filters(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Download filter removed ' in result.output
assert any((string in result.output for string in ('Download filter removed ', 'filtered due to URL')))
@pytest.mark.online
@@ -191,70 +219,6 @@ def test_cli_download_long(test_args: list[str], tmp_path: Path):
assert result.exit_code == 0
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['-l', 'gstd4hk'],
['-l', 'm2601g', '-f', 'yaml'],
['-l', 'n60t4c', '-f', 'xml'],
))
def test_cli_archive_single(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'Mindustry', '-L', 25],
['--subreddit', 'Mindustry', '-L', 25, '--format', 'xml'],
['--subreddit', 'Mindustry', '-L', 25, '--format', 'yaml'],
['--subreddit', 'Mindustry', '-L', 25, '--sort', 'new'],
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day'],
['--subreddit', 'Mindustry', '-L', 25, '--time', 'day', '--sort', 'new'],
))
def test_cli_archive_subreddit(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--user', 'me', '--authenticate', '--all-comments', '-L', '10'],
))
def test_cli_archive_all_user_comments(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.slow
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--subreddit', 'all', '-L', 100],
['--subreddit', 'all', '-L', 100, '--sort', 'new'],
))
def test_cli_archive_long(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_archive_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert re.search(r'Writing entry .*? to file in .*? format', result.output)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.slow
@@ -265,12 +229,15 @@ def test_cli_archive_long(test_args: list[str], tmp_path: Path):
['--user', 'sdclhgsolgjeroij', '--upvoted', '-L', 10],
['--subreddit', 'submitters', '-L', 10], # Private subreddit
['--subreddit', 'donaldtrump', '-L', 10], # Banned subreddit
['--user', 'djnish', '--user', 'helen_darten', '-m', 'cuteanimalpics', '-L', 10],
['--subreddit', 'friends', '-L', 10],
))
def test_cli_download_soft_fail(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Downloaded' not in result.output
@pytest.mark.online
@@ -333,9 +300,55 @@ def test_cli_download_subreddit_exclusion(test_args: list[str], tmp_path: Path):
['--file-scheme', '{TITLE}'],
['--file-scheme', '{TITLE}_test_{SUBREDDIT}'],
))
def test_cli_file_scheme_warning(test_args: list[str], tmp_path: Path):
def test_cli_download_file_scheme_warning(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Some files might not be downloaded due to name conflicts' in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['-l', 'm2601g', '--disable-module', 'Direct'],
['-l', 'nnb9vs', '--disable-module', 'YoutubeDlFallback'],
['-l', 'nnb9vs', '--disable-module', 'youtubedlfallback'],
))
def test_cli_download_disable_modules(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'skipped due to disabled module' in result.output
assert 'Downloaded submission' not in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
def test_cli_download_include_id_file(tmp_path: Path):
test_file = Path(tmp_path, 'include.txt')
test_args = ['--include-id-file', str(test_file)]
test_file.write_text('odr9wg\nody576')
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Downloaded submission' in result.output
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.skipif(not does_test_config_exist, reason='A test config file is required for integration tests')
@pytest.mark.parametrize('test_args', (
['--ignore-user', 'ArjanEgges', '-l', 'm3hxzd'],
))
def test_cli_download_ignore_user(test_args: list[str], tmp_path: Path):
runner = CliRunner()
test_args = create_basic_args_for_download_runner(test_args, tmp_path)
result = runner.invoke(cli, test_args)
assert result.exit_code == 0
assert 'Downloaded submission' not in result.output
assert 'being an ignored user' in result.output

View File

@@ -4,8 +4,9 @@ from unittest.mock import MagicMock
import pytest
from bdfr.exceptions import NotADownloadableLinkError
from bdfr.resource import Resource
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
@pytest.mark.online
@@ -13,16 +14,26 @@ from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import Youtub
('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', True),
('https://www.youtube.com/watch?v=P19nvJOmqCc', True),
('https://www.example.com/test', False),
('https://milesmatrix.bandcamp.com/album/la-boum/', False),
))
def test_can_handle_link(test_url: str, expected: bool):
result = YoutubeDlFallback.can_handle_link(test_url)
result = YtdlpFallback.can_handle_link(test_url)
assert result == expected
@pytest.mark.online
@pytest.mark.parametrize('test_url', (
'https://milesmatrix.bandcamp.com/album/la-boum/',
))
def test_info_extraction_bad(test_url: str):
with pytest.raises(NotADownloadableLinkError):
YtdlpFallback.get_video_attributes(test_url)
@pytest.mark.online
@pytest.mark.slow
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
('https://streamable.com/dt46y', '1e7f4928e55de6e3ca23d85cc9246bbb'),
('https://streamable.com/dt46y', 'b7e465adaade5f2b6d8c2b4b7d0a2878'),
('https://streamable.com/t8sem', '49b2d1220c485455548f1edbc05d4ecf'),
('https://www.reddit.com/r/specializedtools/comments/n2nw5m/bamboo_splitter/', '21968d3d92161ea5e0abdcaf6311b06c'),
('https://v.redd.it/9z1dnk3xr5k61', '351a2b57e888df5ccbc508056511f38d'),
@@ -30,8 +41,10 @@ def test_can_handle_link(test_url: str, expected: bool):
def test_find_resources(test_url: str, expected_hash: str):
test_submission = MagicMock()
test_submission.url = test_url
downloader = YoutubeDlFallback(test_submission)
downloader = YtdlpFallback(test_submission)
resources = downloader.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
for res in resources:
res.download()
assert resources[0].hash.hexdigest() == expected_hash

View File

@@ -21,5 +21,5 @@ def test_download_resource(test_url: str, expected_hash: str):
resources = test_site.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
resources[0].download(120)
resources[0].download()
assert resources[0].hash.hexdigest() == expected_hash

View File

@@ -9,10 +9,11 @@ from bdfr.site_downloaders.base_downloader import BaseDownloader
from bdfr.site_downloaders.direct import Direct
from bdfr.site_downloaders.download_factory import DownloadFactory
from bdfr.site_downloaders.erome import Erome
from bdfr.site_downloaders.fallback_downloaders.youtubedl_fallback import YoutubeDlFallback
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
from bdfr.site_downloaders.gallery import Gallery
from bdfr.site_downloaders.gfycat import Gfycat
from bdfr.site_downloaders.imgur import Imgur
from bdfr.site_downloaders.pornhub import PornHub
from bdfr.site_downloaders.redgifs import Redgifs
from bdfr.site_downloaders.self_post import SelfPost
from bdfr.site_downloaders.youtube import Youtube
@@ -29,6 +30,7 @@ from bdfr.site_downloaders.youtube import Youtube
('https://imgur.com/BuzvZwb.gifv', Imgur),
('https://i.imgur.com/6fNdLst.gif', Direct),
('https://imgur.com/a/MkxAzeg', Imgur),
('https://i.imgur.com/OGeVuAe.giff', Imgur),
('https://www.reddit.com/gallery/lu93m7', Gallery),
('https://gfycat.com/concretecheerfulfinwhale', Gfycat),
('https://www.erome.com/a/NWGw0F09', Erome),
@@ -40,10 +42,12 @@ from bdfr.site_downloaders.youtube import Youtube
('https://i.imgur.com/3SKrQfK.jpg?1', Direct),
('https://dynasty-scans.com/system/images_images/000/017/819/original/80215103_p0.png?1612232781', Direct),
('https://m.imgur.com/a/py3RW0j', Imgur),
('https://v.redd.it/9z1dnk3xr5k61', YoutubeDlFallback),
('https://streamable.com/dt46y', YoutubeDlFallback),
('https://vimeo.com/channels/31259/53576664', YoutubeDlFallback),
('http://video.pbs.org/viralplayer/2365173446/', YoutubeDlFallback),
('https://v.redd.it/9z1dnk3xr5k61', YtdlpFallback),
('https://streamable.com/dt46y', YtdlpFallback),
('https://vimeo.com/channels/31259/53576664', YtdlpFallback),
('http://video.pbs.org/viralplayer/2365173446/', YtdlpFallback),
('https://www.pornhub.com/view_video.php?viewkey=ph5a2ee0461a8d0', PornHub),
('https://www.patreon.com/posts/minecart-track-59346560', Gallery),
))
def test_factory_lever_good(test_submission_url: str, expected_class: BaseDownloader, reddit_instance: praw.Reddit):
result = DownloadFactory.pull_lever(test_submission_url)
@@ -69,6 +73,19 @@ def test_factory_lever_bad(test_url: str):
('https://youtube.com/watch?v=Gv8Wz74FjVA', 'youtube.com/watch'),
('https://i.imgur.com/BuzvZwb.gifv', 'i.imgur.com/BuzvZwb.gifv'),
))
def test_sanitise_urll(test_url: str, expected: str):
result = DownloadFactory._sanitise_url(test_url)
def test_sanitise_url(test_url: str, expected: str):
result = DownloadFactory.sanitise_url(test_url)
assert result == expected
@pytest.mark.parametrize(('test_url', 'expected'), (
('www.example.com/test.asp', True),
('www.example.com/test.html', True),
('www.example.com/test.js', True),
('www.example.com/test.xhtml', True),
('www.example.com/test.mp4', False),
('www.example.com/test.png', False),
))
def test_is_web_resource(test_url: str, expected: bool):
result = DownloadFactory.is_web_resource(test_url)
assert result == expected

View File

@@ -1,6 +1,6 @@
#!/usr/bin/env python3
# coding=utf-8
import re
from unittest.mock import MagicMock
import pytest
@@ -11,47 +11,37 @@ from bdfr.site_downloaders.erome import Erome
@pytest.mark.online
@pytest.mark.parametrize(('test_url', 'expected_urls'), (
('https://www.erome.com/a/vqtPuLXh', (
'https://s11.erome.com/365/vqtPuLXh/KH2qBT99_480p.mp4',
r'https://s\d+.erome.com/365/vqtPuLXh/KH2qBT99_480p.mp4',
)),
('https://www.erome.com/a/ORhX0FZz', (
'https://s4.erome.com/355/ORhX0FZz/9IYQocM9_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/9eEDc8xm_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/EvApC7Rp_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/LruobtMs_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/TJNmSUU5_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/X11Skh6Z_480p.mp4',
'https://s4.erome.com/355/ORhX0FZz/bjlTkpn7_480p.mp4'
r'https://s\d+.erome.com/355/ORhX0FZz/9IYQocM9_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/9eEDc8xm_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/EvApC7Rp_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/LruobtMs_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/TJNmSUU5_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/X11Skh6Z_480p.mp4',
r'https://s\d+.erome.com/355/ORhX0FZz/bjlTkpn7_480p.mp4'
)),
))
def test_get_link(test_url: str, expected_urls: tuple[str]):
result = Erome. _get_links(test_url)
assert set(result) == set(expected_urls)
assert all([any([re.match(p, r) for r in result]) for p in expected_urls])
@pytest.mark.online
@pytest.mark.slow
@pytest.mark.parametrize(('test_url', 'expected_hashes'), (
('https://www.erome.com/a/vqtPuLXh', {
'5da2a8d60d87bed279431fdec8e7d72f'
}),
('https://www.erome.com/i/ItASD33e', {
'b0d73fedc9ce6995c2f2c4fdb6f11eff'
}),
('https://www.erome.com/a/lGrcFxmb', {
'0e98f9f527a911dcedde4f846bb5b69f',
'25696ae364750a5303fc7d7dc78b35c1',
'63775689f438bd393cde7db6d46187de',
'a1abf398cfd4ef9cfaf093ceb10c746a',
'bd9e1a4ea5ef0d6ba47fb90e337c2d14'
}),
@pytest.mark.parametrize(('test_url', 'expected_hashes_len'), (
('https://www.erome.com/a/vqtPuLXh', 1),
('https://www.erome.com/a/4tP3KI6F', 1),
))
def test_download_resource(test_url: str, expected_hashes: tuple[str]):
def test_download_resource(test_url: str, expected_hashes_len: int):
# Can't compare hashes for this test, Erome doesn't return the exact same file from request to request so the hash
# will change back and forth randomly
mock_submission = MagicMock()
mock_submission.url = test_url
test_site = Erome(mock_submission)
resources = test_site.find_resources()
[res.download(120) for res in resources]
for res in resources:
res.download()
resource_hashes = [res.hash.hexdigest() for res in resources]
assert len(resource_hashes) == len(expected_hashes)
assert len(resource_hashes) == expected_hashes_len

View File

@@ -4,34 +4,37 @@
import praw
import pytest
from bdfr.exceptions import SiteDownloaderError
from bdfr.site_downloaders.gallery import Gallery
@pytest.mark.online
@pytest.mark.parametrize(('test_url', 'expected'), (
('https://www.reddit.com/gallery/m6lvrh', {
'https://preview.redd.it/18nzv9ch0hn61.jpg?width=4160&'
'format=pjpg&auto=webp&s=470a825b9c364e0eace0036882dcff926f821de8',
'https://preview.redd.it/jqkizcch0hn61.jpg?width=4160&'
'format=pjpg&auto=webp&s=ae4f552a18066bb6727676b14f2451c5feecf805',
'https://preview.redd.it/k0fnqzbh0hn61.jpg?width=4160&'
'format=pjpg&auto=webp&s=c6a10fececdc33983487c16ad02219fd3fc6cd76',
'https://preview.redd.it/m3gamzbh0hn61.jpg?width=4160&'
'format=pjpg&auto=webp&s=0dd90f324711851953e24873290b7f29ec73c444'
@pytest.mark.parametrize(('test_ids', 'expected'), (
([
{'media_id': '18nzv9ch0hn61'},
{'media_id': 'jqkizcch0hn61'},
{'media_id': 'k0fnqzbh0hn61'},
{'media_id': 'm3gamzbh0hn61'},
], {
'https://i.redd.it/18nzv9ch0hn61.jpg',
'https://i.redd.it/jqkizcch0hn61.jpg',
'https://i.redd.it/k0fnqzbh0hn61.jpg',
'https://i.redd.it/m3gamzbh0hn61.jpg'
}),
('https://www.reddit.com/gallery/ljyy27', {
'https://preview.redd.it/04vxj25uqih61.png?width=92&'
'format=png&auto=webp&s=6513f3a5c5128ee7680d402cab5ea4fb2bbeead4',
'https://preview.redd.it/0fnx83kpqih61.png?width=241&'
'format=png&auto=webp&s=655e9deb6f499c9ba1476eaff56787a697e6255a',
'https://preview.redd.it/7zkmr1wqqih61.png?width=237&'
'format=png&auto=webp&s=19de214e634cbcad9959f19570c616e29be0c0b0',
'https://preview.redd.it/u37k5gxrqih61.png?width=443&'
'format=png&auto=webp&s=e74dae31841fe4a2545ffd794d3b25b9ff0eb862'
([
{'media_id': '04vxj25uqih61'},
{'media_id': '0fnx83kpqih61'},
{'media_id': '7zkmr1wqqih61'},
{'media_id': 'u37k5gxrqih61'},
], {
'https://i.redd.it/04vxj25uqih61.png',
'https://i.redd.it/0fnx83kpqih61.png',
'https://i.redd.it/7zkmr1wqqih61.png',
'https://i.redd.it/u37k5gxrqih61.png'
}),
))
def test_gallery_get_links(test_url: str, expected: set[str]):
results = Gallery._get_links(test_url)
def test_gallery_get_links(test_ids: list[dict], expected: set[str]):
results = Gallery._get_links(test_ids)
assert set(results) == expected
@@ -39,22 +42,45 @@ def test_gallery_get_links(test_url: str, expected: set[str]):
@pytest.mark.reddit
@pytest.mark.parametrize(('test_submission_id', 'expected_hashes'), (
('m6lvrh', {
'6c8a892ae8066cbe119218bcaac731e1',
'93ce177f8cb7994906795f4615114d13',
'9a293adf19354f14582608cf22124574',
'b73e2c3daee02f99404644ea02f1ae65'
'5c42b8341dd56eebef792e86f3981c6a',
'8f38d76da46f4057bf2773a778e725ca',
'f5776f8f90491c8b770b8e0a6bfa49b3',
'fa1a43c94da30026ad19a9813a0ed2c2',
}),
('ljyy27', {
'1bc38bed88f9c4770e22a37122d5c941',
'2539a92b78f3968a069df2dffe2279f9',
'37dea50281c219b905e46edeefc1a18d',
'ec4924cf40549728dcf53dd40bc7a73c'
'359c203ec81d0bc00e675f1023673238',
'79262fd46bce5bfa550d878a3b898be4',
'808c35267f44acb523ce03bfa5687404',
'ec8b65bdb7f1279c4b3af0ea2bbb30c3',
}),
('obkflw', {
'65163f685fb28c5b776e0e77122718be',
'2a337eb5b13c34d3ca3f51b5db7c13e9',
}),
('rb3ub6', { # patreon post
'748a976c6cedf7ea85b6f90e7cb685c7',
'839796d7745e88ced6355504e1f74508',
'bcdb740367d0f19f97a77e614b48a42d',
'0f230b8c4e5d103d35a773fab9814ec3',
'e5192d6cb4f84c4f4a658355310bf0f9',
'91cbe172cd8ccbcf049fcea4204eb979',
})
))
def test_gallery_download(test_submission_id: str, expected_hashes: set[str], reddit_instance: praw.Reddit):
test_submission = reddit_instance.submission(id=test_submission_id)
gallery = Gallery(test_submission)
results = gallery.find_resources()
[res.download(120) for res in results]
[res.download() for res in results]
hashes = [res.hash.hexdigest() for res in results]
assert set(hashes) == expected_hashes
@pytest.mark.parametrize('test_id', (
'n0pyzp',
'nxyahw',
))
def test_gallery_download_raises_right_error(test_id: str, reddit_instance: praw.Reddit):
test_submission = reddit_instance.submission(id=test_id)
gallery = Gallery(test_submission)
with pytest.raises(SiteDownloaderError):
gallery.find_resources()

View File

@@ -13,7 +13,6 @@ from bdfr.site_downloaders.gfycat import Gfycat
@pytest.mark.parametrize(('test_url', 'expected_url'), (
('https://gfycat.com/definitivecaninecrayfish', 'https://giant.gfycat.com/DefinitiveCanineCrayfish.mp4'),
('https://gfycat.com/dazzlingsilkyiguana', 'https://giant.gfycat.com/DazzlingSilkyIguana.mp4'),
('https://gfycat.com/webbedimpurebutterfly', 'https://thumbs2.redgifs.com/WebbedImpureButterfly.mp4'),
))
def test_get_link(test_url: str, expected_url: str):
result = Gfycat._get_link(test_url)
@@ -32,5 +31,5 @@ def test_download_resource(test_url: str, expected_hash: str):
resources = test_site.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
resources[0].download(120)
resources[0].download()
assert resources[0].hash.hexdigest() == expected_hash

View File

@@ -65,11 +65,11 @@ def test_get_data_album(test_url: str, expected_gen_dict: dict, expected_image_d
{'hash': 'dLk3FGY', 'title': '', 'ext': '.mp4', 'animated': True}
),
(
'https://imgur.com/BuzvZwb.gifv',
'https://imgur.com/65FqTpT.gifv',
{
'hash': 'BuzvZwb',
'hash': '65FqTpT',
'title': '',
'description': 'Akron Glass Works',
'description': '',
'animated': True,
'mimetype': 'video/mp4'
},
@@ -111,7 +111,7 @@ def test_imgur_extension_validation_bad(test_extension: str):
),
(
'https://imgur.com/gallery/IjJJdlC',
('7227d4312a9779b74302724a0cfa9081',),
('740b006cf9ec9d6f734b6e8f5130bdab',),
),
(
'https://imgur.com/a/dcc84Gt',
@@ -122,6 +122,42 @@ def test_imgur_extension_validation_bad(test_extension: str):
'029c475ce01b58fdf1269d8771d33913',
),
),
(
'https://imgur.com/a/eemHCCK',
(
'9cb757fd8f055e7ef7aa88addc9d9fa5',
'b6cb6c918e2544e96fb7c07d828774b5',
'fb6c913d721c0bbb96aa65d7f560d385',
),
),
(
'https://i.imgur.com/lFJai6i.gifv',
('01a6e79a30bec0e644e5da12365d5071',),
),
(
'https://i.imgur.com/ywSyILa.gifv?',
('56d4afc32d2966017c38d98568709b45',),
),
(
'https://imgur.com/ubYwpbk.GIFV',
('d4a774aac1667783f9ed3a1bd02fac0c',),
),
(
'https://i.imgur.com/j1CNCZY.gifv',
('58e7e6d972058c18b7ecde910ca147e3',),
),
(
'https://i.imgur.com/uTvtQsw.gifv',
('46c86533aa60fc0e09f2a758513e3ac2',),
),
(
'https://i.imgur.com/OGeVuAe.giff',
('77389679084d381336f168538793f218',)
),
(
'https://i.imgur.com/OGeVuAe.gift',
('77389679084d381336f168538793f218',)
),
))
def test_find_resources(test_url: str, expected_hashes: list[str]):
mock_download = Mock()
@@ -129,7 +165,6 @@ def test_find_resources(test_url: str, expected_hashes: list[str]):
downloader = Imgur(mock_download)
results = downloader.find_resources()
assert all([isinstance(res, Resource) for res in results])
[res.download(120) for res in results]
[res.download() for res in results]
hashes = set([res.hash.hexdigest() for res in results])
assert len(results) == len(expected_hashes)
assert hashes == set(expected_hashes)

View File

@@ -0,0 +1,39 @@
#!/usr/bin/env python3
# coding=utf-8
from unittest.mock import MagicMock
import pytest
from bdfr.exceptions import SiteDownloaderError
from bdfr.resource import Resource
from bdfr.site_downloaders.pornhub import PornHub
@pytest.mark.online
@pytest.mark.slow
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
('https://www.pornhub.com/view_video.php?viewkey=ph6074c59798497', 'd9b99e4ebecf2d8d67efe5e70d2acf8a'),
('https://www.pornhub.com/view_video.php?viewkey=ph5ede121f0d3f8', ''),
))
def test_find_resources_good(test_url: str, expected_hash: str):
test_submission = MagicMock()
test_submission.url = test_url
downloader = PornHub(test_submission)
resources = downloader.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
resources[0].download()
assert resources[0].hash.hexdigest() == expected_hash
@pytest.mark.online
@pytest.mark.parametrize('test_url', (
'https://www.pornhub.com/view_video.php?viewkey=ph5ede121f0d3f8',
))
def test_find_resources_good(test_url: str):
test_submission = MagicMock()
test_submission.url = test_url
downloader = PornHub(test_submission)
with pytest.raises(SiteDownloaderError):
downloader.find_resources()

View File

@@ -15,10 +15,8 @@ from bdfr.site_downloaders.redgifs import Redgifs
'https://thumbs2.redgifs.com/FrighteningVictoriousSalamander.mp4'),
('https://redgifs.com/watch/springgreendecisivetaruca',
'https://thumbs2.redgifs.com/SpringgreenDecisiveTaruca.mp4'),
('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer',
'https://thumbs2.redgifs.com/RegalShoddyHorsechestnutleafminer.mp4'),
('https://www.gifdeliverynetwork.com/maturenexthippopotamus',
'https://thumbs2.redgifs.com/MatureNextHippopotamus.mp4'),
('https://www.redgifs.com/watch/palegoldenrodrawhalibut',
'https://thumbs2.redgifs.com/PalegoldenrodRawHalibut.mp4'),
))
def test_get_link(test_url: str, expected: str):
result = Redgifs._get_link(test_url)
@@ -29,8 +27,8 @@ def test_get_link(test_url: str, expected: str):
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
('https://redgifs.com/watch/frighteningvictorioussalamander', '4007c35d9e1f4b67091b5f12cffda00a'),
('https://redgifs.com/watch/springgreendecisivetaruca', '8dac487ac49a1f18cc1b4dabe23f0869'),
('https://www.gifdeliverynetwork.com/maturenexthippopotamus', '9bec0a9e4163a43781368ed5d70471df'),
('https://www.gifdeliverynetwork.com/regalshoddyhorsechestnutleafminer', '8afb4e2c090a87140230f2352bf8beba'),
('https://redgifs.com/watch/leafysaltydungbeetle', '076792c660b9c024c0471ef4759af8bd'),
('https://www.redgifs.com/watch/palegoldenrodrawhalibut', '46d5aa77fe80c6407de1ecc92801c10e'),
))
def test_download_resource(test_url: str, expected_hash: str):
mock_submission = Mock()
@@ -39,5 +37,5 @@ def test_download_resource(test_url: str, expected_hash: str):
resources = test_site.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
resources[0].download(120)
resources[0].download()
assert resources[0].hash.hexdigest() == expected_hash

View File

@@ -0,0 +1,73 @@
#!/usr/bin/env python3
# coding=utf-8
from unittest.mock import Mock
import pytest
from bdfr.resource import Resource
from bdfr.site_downloaders.vidble import Vidble
@pytest.mark.parametrize(('test_url', 'expected'), (
('/RDFbznUvcN_med.jpg', '/RDFbznUvcN.jpg'),
))
def test_change_med_url(test_url: str, expected: str):
result = Vidble.change_med_url(test_url)
assert result == expected
@pytest.mark.online
@pytest.mark.parametrize(('test_url', 'expected'), (
('https://www.vidble.com/show/UxsvAssYe5', {
'https://www.vidble.com/UxsvAssYe5.gif',
}),
('https://vidble.com/show/RDFbznUvcN', {
'https://www.vidble.com/RDFbznUvcN.jpg',
}),
('https://vidble.com/album/h0jTLs6B', {
'https://www.vidble.com/XG4eAoJ5JZ.jpg',
'https://www.vidble.com/IqF5UdH6Uq.jpg',
'https://www.vidble.com/VWuNsnLJMD.jpg',
'https://www.vidble.com/sMmM8O650W.jpg',
}),
('https://vidble.com/watch?v=0q4nWakqM6kzQWxlePD8N62Dsflev0N9', {
'https://www.vidble.com/0q4nWakqM6kzQWxlePD8N62Dsflev0N9.mp4',
}),
('https://www.vidble.com/pHuwWkOcEb', {
'https://www.vidble.com/pHuwWkOcEb.jpg',
}),
))
def test_get_links(test_url: str, expected: set[str]):
results = Vidble.get_links(test_url)
assert results == expected
@pytest.mark.parametrize(('test_url', 'expected_hashes'), (
('https://www.vidble.com/show/UxsvAssYe5', {
'0ef2f8e0e0b45936d2fb3e6fbdf67e28',
}),
('https://vidble.com/show/RDFbznUvcN', {
'c2dd30a71e32369c50eed86f86efff58',
}),
('https://vidble.com/album/h0jTLs6B', {
'3b3cba02e01c91f9858a95240b942c71',
'dd6ecf5fc9e936f9fb614eb6a0537f99',
'b31a942cd8cdda218ed547bbc04c3a27',
'6f77c570b451eef4222804bd52267481',
}),
('https://vidble.com/watch?v=0q4nWakqM6kzQWxlePD8N62Dsflev0N9', {
'cebe9d5f24dba3b0443e5097f160ca83',
}),
('https://www.vidble.com/pHuwWkOcEb', {
'585f486dd0b2f23a57bddbd5bf185bc7',
}),
))
def test_find_resources(test_url: str, expected_hashes: set[str]):
mock_download = Mock()
mock_download.url = test_url
downloader = Vidble(mock_download)
results = downloader.find_resources()
assert all([isinstance(res, Resource) for res in results])
[res.download() for res in results]
hashes = set([res.hash.hexdigest() for res in results])
assert hashes == set(expected_hashes)

View File

@@ -5,6 +5,7 @@ from unittest.mock import MagicMock
import pytest
from bdfr.exceptions import NotADownloadableLinkError
from bdfr.resource import Resource
from bdfr.site_downloaders.youtube import Youtube
@@ -12,15 +13,29 @@ from bdfr.site_downloaders.youtube import Youtube
@pytest.mark.online
@pytest.mark.slow
@pytest.mark.parametrize(('test_url', 'expected_hash'), (
('https://www.youtube.com/watch?v=uSm2VDgRIUs', 'f70b704b4b78b9bb5cd032bfc26e4971'),
('https://www.youtube.com/watch?v=m-tKnjFwleU', '30314930d853afff8ebc7d8c36a5b833'),
('https://www.youtube.com/watch?v=uSm2VDgRIUs', '2d60b54582df5b95ec72bb00b580d2ff'),
('https://www.youtube.com/watch?v=GcI7nxQj7HA', '5db0fc92a0a7fb9ac91e63505eea9cf0'),
('https://youtu.be/TMqPOlp4tNo', 'f68c00b018162857f3df4844c45302e7'), # Age restricted
))
def test_find_resources(test_url: str, expected_hash: str):
def test_find_resources_good(test_url: str, expected_hash: str):
test_submission = MagicMock()
test_submission.url = test_url
downloader = Youtube(test_submission)
resources = downloader.find_resources()
assert len(resources) == 1
assert isinstance(resources[0], Resource)
resources[0].download(120)
resources[0].download()
assert resources[0].hash.hexdigest() == expected_hash
@pytest.mark.online
@pytest.mark.parametrize('test_url', (
'https://www.polygon.com/disney-plus/2020/5/14/21249881/gargoyles-animated-series-disney-plus-greg-weisman'
'-interview-oj-simpson-goliath-chronicles',
))
def test_find_resources_bad(test_url: str):
test_submission = MagicMock()
test_submission.url = test_url
downloader = Youtube(test_submission)
with pytest.raises(NotADownloadableLinkError):
downloader.find_resources()

View File

@@ -7,51 +7,20 @@ from unittest.mock import MagicMock
import praw
import pytest
from bdfr.archive_entry.submission_archive_entry import SubmissionArchiveEntry
from bdfr.archiver import Archiver
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_id', (
'm3reby',
@pytest.mark.parametrize(('test_submission_id', 'test_format'), (
('m3reby', 'xml'),
('m3reby', 'json'),
('m3reby', 'yaml'),
))
def test_write_submission_json(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
def test_write_submission_json(test_submission_id: str, tmp_path: Path, test_format: str, reddit_instance: praw.Reddit):
archiver_mock = MagicMock()
test_path = Path(tmp_path, 'test.json')
archiver_mock.args.format = test_format
test_path = Path(tmp_path, 'test')
test_submission = reddit_instance.submission(id=test_submission_id)
archiver_mock.file_name_formatter.format_path.return_value = test_path
test_entry = SubmissionArchiveEntry(test_submission)
Archiver._write_entry_json(archiver_mock, test_entry)
archiver_mock._write_content_to_disk.assert_called_once()
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_id', (
'm3reby',
))
def test_write_submission_xml(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
archiver_mock = MagicMock()
test_path = Path(tmp_path, 'test.xml')
test_submission = reddit_instance.submission(id=test_submission_id)
archiver_mock.file_name_formatter.format_path.return_value = test_path
test_entry = SubmissionArchiveEntry(test_submission)
Archiver._write_entry_xml(archiver_mock, test_entry)
archiver_mock._write_content_to_disk.assert_called_once()
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_id', (
'm3reby',
))
def test_write_submission_yaml(test_submission_id: str, tmp_path: Path, reddit_instance: praw.Reddit):
archiver_mock = MagicMock()
archiver_mock.download_directory = tmp_path
test_path = Path(tmp_path, 'test.yaml')
test_submission = reddit_instance.submission(id=test_submission_id)
archiver_mock.file_name_formatter.format_path.return_value = test_path
test_entry = SubmissionArchiveEntry(test_submission)
Archiver._write_entry_yaml(archiver_mock, test_entry)
archiver_mock._write_content_to_disk.assert_called_once()
Archiver.write_entry(archiver_mock, test_submission)

452
tests/test_connector.py Normal file
View File

@@ -0,0 +1,452 @@
#!/usr/bin/env python3
# coding=utf-8
from datetime import datetime, timedelta
from pathlib import Path
from typing import Iterator
from unittest.mock import MagicMock
import praw
import praw.models
import pytest
from bdfr.configuration import Configuration
from bdfr.connector import RedditConnector, RedditTypes
from bdfr.download_filter import DownloadFilter
from bdfr.exceptions import BulkDownloaderException
from bdfr.file_name_formatter import FileNameFormatter
from bdfr.site_authenticator import SiteAuthenticator
@pytest.fixture()
def args() -> Configuration:
args = Configuration()
args.time_format = 'ISO'
return args
@pytest.fixture()
def downloader_mock(args: Configuration):
downloader_mock = MagicMock()
downloader_mock.args = args
downloader_mock.sanitise_subreddit_name = RedditConnector.sanitise_subreddit_name
downloader_mock.create_filtered_listing_generator = lambda x: RedditConnector.create_filtered_listing_generator(
downloader_mock, x)
downloader_mock.split_args_input = RedditConnector.split_args_input
downloader_mock.master_hash_list = {}
return downloader_mock
def assert_all_results_are_submissions(result_limit: int, results: list[Iterator]) -> list:
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
assert not any([isinstance(m, MagicMock) for m in results])
if result_limit is not None:
assert len(results) == result_limit
return results
def assert_all_results_are_submissions_or_comments(result_limit: int, results: list[Iterator]) -> list:
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) or isinstance(res, praw.models.Comment) for res in results])
assert not any([isinstance(m, MagicMock) for m in results])
if result_limit is not None:
assert len(results) == result_limit
return results
def test_determine_directories(tmp_path: Path, downloader_mock: MagicMock):
downloader_mock.args.directory = tmp_path / 'test'
downloader_mock.config_directories.user_config_dir = tmp_path
RedditConnector.determine_directories(downloader_mock)
assert Path(tmp_path / 'test').exists()
@pytest.mark.parametrize(('skip_extensions', 'skip_domains'), (
([], []),
(['.test'], ['test.com'],),
))
def test_create_download_filter(skip_extensions: list[str], skip_domains: list[str], downloader_mock: MagicMock):
downloader_mock.args.skip = skip_extensions
downloader_mock.args.skip_domain = skip_domains
result = RedditConnector.create_download_filter(downloader_mock)
assert isinstance(result, DownloadFilter)
assert result.excluded_domains == skip_domains
assert result.excluded_extensions == skip_extensions
@pytest.mark.parametrize(('test_time', 'expected'), (
('all', 'all'),
('hour', 'hour'),
('day', 'day'),
('week', 'week'),
('random', 'all'),
('', 'all'),
))
def test_create_time_filter(test_time: str, expected: str, downloader_mock: MagicMock):
downloader_mock.args.time = test_time
result = RedditConnector.create_time_filter(downloader_mock)
assert isinstance(result, RedditTypes.TimeType)
assert result.name.lower() == expected
@pytest.mark.parametrize(('test_sort', 'expected'), (
('', 'hot'),
('hot', 'hot'),
('controversial', 'controversial'),
('new', 'new'),
))
def test_create_sort_filter(test_sort: str, expected: str, downloader_mock: MagicMock):
downloader_mock.args.sort = test_sort
result = RedditConnector.create_sort_filter(downloader_mock)
assert isinstance(result, RedditTypes.SortType)
assert result.name.lower() == expected
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
('{POSTID}', '{SUBREDDIT}'),
('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}'),
('{POSTID}', 'test'),
('{POSTID}', ''),
('{POSTID}', '{SUBREDDIT}/{REDDITOR}'),
))
def test_create_file_name_formatter(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
downloader_mock.args.file_scheme = test_file_scheme
downloader_mock.args.folder_scheme = test_folder_scheme
result = RedditConnector.create_file_name_formatter(downloader_mock)
assert isinstance(result, FileNameFormatter)
assert result.file_format_string == test_file_scheme
assert result.directory_format_string == test_folder_scheme.split('/')
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
('', ''),
('', '{SUBREDDIT}'),
('test', '{SUBREDDIT}'),
))
def test_create_file_name_formatter_bad(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
downloader_mock.args.file_scheme = test_file_scheme
downloader_mock.args.folder_scheme = test_folder_scheme
with pytest.raises(BulkDownloaderException):
RedditConnector.create_file_name_formatter(downloader_mock)
def test_create_authenticator(downloader_mock: MagicMock):
result = RedditConnector.create_authenticator(downloader_mock)
assert isinstance(result, SiteAuthenticator)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_ids', (
('lvpf4l',),
('lvpf4l', 'lvqnsn'),
('lvpf4l', 'lvqnsn', 'lvl9kd'),
))
def test_get_submissions_from_link(
test_submission_ids: list[str],
reddit_instance: praw.Reddit,
downloader_mock: MagicMock):
downloader_mock.args.link = test_submission_ids
downloader_mock.reddit_instance = reddit_instance
results = RedditConnector.get_submissions_from_link(downloader_mock)
assert all([isinstance(sub, praw.models.Submission) for res in results for sub in res])
assert len(results[0]) == len(test_submission_ids)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddits', 'limit', 'sort_type', 'time_filter', 'max_expected_len'), (
(('Futurology',), 10, 'hot', 'all', 10),
(('Futurology', 'Mindustry, Python'), 10, 'hot', 'all', 30),
(('Futurology',), 20, 'hot', 'all', 20),
(('Futurology', 'Python'), 10, 'hot', 'all', 20),
(('Futurology',), 100, 'hot', 'all', 100),
(('Futurology',), 0, 'hot', 'all', 0),
(('Futurology',), 10, 'top', 'all', 10),
(('Futurology',), 10, 'top', 'week', 10),
(('Futurology',), 10, 'hot', 'week', 10),
))
def test_get_subreddit_normal(
test_subreddits: list[str],
limit: int,
sort_type: str,
time_filter: str,
max_expected_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
):
downloader_mock.args.limit = limit
downloader_mock.args.sort = sort_type
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
downloader_mock.sort_filter = RedditConnector.create_sort_filter(downloader_mock)
downloader_mock.determine_sort_function.return_value = RedditConnector.determine_sort_function(downloader_mock)
downloader_mock.args.subreddit = test_subreddits
downloader_mock.reddit_instance = reddit_instance
results = RedditConnector.get_subreddits(downloader_mock)
test_subreddits = downloader_mock.split_args_input(test_subreddits)
results = [sub for res1 in results for sub in res1]
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
assert all([res.subreddit.display_name in test_subreddits for res in results])
assert len(results) <= max_expected_len
assert not any([isinstance(m, MagicMock) for m in results])
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_time', 'test_delta'), (
('hour', timedelta(hours=1)),
('day', timedelta(days=1)),
('week', timedelta(days=7)),
('month', timedelta(days=31)),
('year', timedelta(days=365)),
))
def test_get_subreddit_time_verification(
test_time: str,
test_delta: timedelta,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
):
downloader_mock.args.limit = 10
downloader_mock.args.sort = 'top'
downloader_mock.args.time = test_time
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
downloader_mock.sort_filter = RedditConnector.create_sort_filter(downloader_mock)
downloader_mock.determine_sort_function.return_value = RedditConnector.determine_sort_function(downloader_mock)
downloader_mock.args.subreddit = ['all']
downloader_mock.reddit_instance = reddit_instance
results = RedditConnector.get_subreddits(downloader_mock)
results = [sub for res1 in results for sub in res1]
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
nowtime = datetime.now()
for r in results:
result_time = datetime.fromtimestamp(r.created_utc)
time_diff = nowtime - result_time
assert time_diff < test_delta
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddits', 'search_term', 'limit', 'time_filter', 'max_expected_len'), (
(('Python',), 'scraper', 10, 'all', 10),
(('Python',), '', 10, 'all', 0),
(('Python',), 'djsdsgewef', 10, 'all', 0),
(('Python',), 'scraper', 10, 'year', 10),
))
def test_get_subreddit_search(
test_subreddits: list[str],
search_term: str,
time_filter: str,
limit: int,
max_expected_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
):
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.args.limit = limit
downloader_mock.args.search = search_term
downloader_mock.args.subreddit = test_subreddits
downloader_mock.reddit_instance = reddit_instance
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.time = time_filter
downloader_mock.time_filter = RedditConnector.create_time_filter(downloader_mock)
results = RedditConnector.get_subreddits(downloader_mock)
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
assert all([res.subreddit.display_name in test_subreddits for res in results])
assert len(results) <= max_expected_len
if max_expected_len != 0:
assert len(results) > 0
assert not any([isinstance(m, MagicMock) for m in results])
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_user', 'test_multireddits', 'limit'), (
('helen_darten', ('cuteanimalpics',), 10),
('korfor', ('chess',), 100),
))
# Good sources at https://www.reddit.com/r/multihub/
def test_get_multireddits_public(
test_user: str,
test_multireddits: list[str],
limit: int,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.limit = limit
downloader_mock.args.multireddit = test_multireddits
downloader_mock.args.user = [test_user]
downloader_mock.reddit_instance = reddit_instance
downloader_mock.create_filtered_listing_generator.return_value = \
RedditConnector.create_filtered_listing_generator(
downloader_mock,
reddit_instance.multireddit(test_user, test_multireddits[0]),
)
results = RedditConnector.get_multireddits(downloader_mock)
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
assert len(results) == limit
assert not any([isinstance(m, MagicMock) for m in results])
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_user', 'limit'), (
('danigirl3694', 10),
('danigirl3694', 50),
('CapitanHam', None),
))
def test_get_user_submissions(test_user: str, limit: int, downloader_mock: MagicMock, reddit_instance: praw.Reddit):
downloader_mock.args.limit = limit
downloader_mock.determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.submitted = True
downloader_mock.args.user = [test_user]
downloader_mock.authenticated = False
downloader_mock.reddit_instance = reddit_instance
downloader_mock.create_filtered_listing_generator.return_value = \
RedditConnector.create_filtered_listing_generator(
downloader_mock,
reddit_instance.redditor(test_user).submissions,
)
results = RedditConnector.get_user_data(downloader_mock)
results = assert_all_results_are_submissions(limit, results)
assert all([res.author.name == test_user for res in results])
assert not any([isinstance(m, MagicMock) for m in results])
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.authenticated
@pytest.mark.parametrize('test_flag', (
'upvoted',
'saved',
))
def test_get_user_authenticated_lists(
test_flag: str,
downloader_mock: MagicMock,
authenticated_reddit_instance: praw.Reddit,
):
downloader_mock.args.__dict__[test_flag] = True
downloader_mock.reddit_instance = authenticated_reddit_instance
downloader_mock.args.limit = 10
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.user = [RedditConnector.resolve_user_name(downloader_mock, 'me')]
results = RedditConnector.get_user_data(downloader_mock)
assert_all_results_are_submissions_or_comments(10, results)
@pytest.mark.parametrize(('test_name', 'expected'), (
('Mindustry', 'Mindustry'),
('Futurology', 'Futurology'),
('r/Mindustry', 'Mindustry'),
('TrollXChromosomes', 'TrollXChromosomes'),
('r/TrollXChromosomes', 'TrollXChromosomes'),
('https://www.reddit.com/r/TrollXChromosomes/', 'TrollXChromosomes'),
('https://www.reddit.com/r/TrollXChromosomes', 'TrollXChromosomes'),
('https://www.reddit.com/r/Futurology/', 'Futurology'),
('https://www.reddit.com/r/Futurology', 'Futurology'),
))
def test_sanitise_subreddit_name(test_name: str, expected: str):
result = RedditConnector.sanitise_subreddit_name(test_name)
assert result == expected
@pytest.mark.parametrize(('test_subreddit_entries', 'expected'), (
(['test1', 'test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1,test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1, test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1; test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1, test2', 'test1,test2,test3', 'test4'], {'test1', 'test2', 'test3', 'test4'}),
([''], {''}),
(['test'], {'test'}),
))
def test_split_subreddit_entries(test_subreddit_entries: list[str], expected: set[str]):
results = RedditConnector.split_args_input(test_subreddit_entries)
assert results == expected
def test_read_submission_ids_from_file(downloader_mock: MagicMock, tmp_path: Path):
test_file = tmp_path / 'test.txt'
test_file.write_text('aaaaaa\nbbbbbb')
results = RedditConnector.read_id_files([str(test_file)])
assert results == {'aaaaaa', 'bbbbbb'}
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'Paracortex',
'crowdstrike',
'HannibalGoddamnit',
))
def test_check_user_existence_good(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'lhnhfkuhwreolo',
'adlkfmnhglojh',
))
def test_check_user_existence_nonexistent(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
with pytest.raises(BulkDownloaderException, match='Could not find'):
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'Bree-Boo',
))
def test_check_user_existence_banned(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
with pytest.raises(BulkDownloaderException, match='is banned'):
RedditConnector.check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddit_name', 'expected_message'), (
('donaldtrump', 'cannot be found'),
('submitters', 'private and cannot be scraped')
))
def test_check_subreddit_status_bad(test_subreddit_name: str, expected_message: str, reddit_instance: praw.Reddit):
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
with pytest.raises(BulkDownloaderException, match=expected_message):
RedditConnector.check_subreddit_status(test_subreddit)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_subreddit_name', (
'Python',
'Mindustry',
'TrollXChromosomes',
'all',
))
def test_check_subreddit_status_good(test_subreddit_name: str, reddit_instance: praw.Reddit):
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
RedditConnector.check_subreddit_status(test_subreddit)

View File

@@ -46,7 +46,7 @@ def test_filter_domain(test_url: str, expected: bool, download_filter: DownloadF
('http://reddit.com/test.gif', False),
))
def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilter):
test_resource = Resource(MagicMock(), test_url)
test_resource = Resource(MagicMock(), test_url, lambda: None)
result = download_filter.check_resource(test_resource)
assert result == expected
@@ -59,6 +59,6 @@ def test_filter_all(test_url: str, expected: bool, download_filter: DownloadFilt
))
def test_filter_empty_filter(test_url: str):
download_filter = DownloadFilter()
test_resource = Resource(MagicMock(), test_url)
test_resource = Resource(MagicMock(), test_url, lambda: None)
result = download_filter.check_resource(test_resource)
assert result is True

View File

@@ -1,22 +1,18 @@
#!/usr/bin/env python3
# coding=utf-8
import os
import re
from pathlib import Path
from typing import Iterator
from unittest.mock import MagicMock
from unittest.mock import MagicMock, patch
import praw
import praw.models
import pytest
from bdfr.__main__ import setup_logging
from bdfr.configuration import Configuration
from bdfr.download_filter import DownloadFilter
from bdfr.downloader import RedditDownloader, RedditTypes
from bdfr.exceptions import BulkDownloaderException
from bdfr.file_name_formatter import FileNameFormatter
from bdfr.site_authenticator import SiteAuthenticator
from bdfr.connector import RedditConnector
from bdfr.downloader import RedditDownloader
@pytest.fixture()
@@ -30,314 +26,105 @@ def args() -> Configuration:
def downloader_mock(args: Configuration):
downloader_mock = MagicMock()
downloader_mock.args = args
downloader_mock._sanitise_subreddit_name = RedditDownloader._sanitise_subreddit_name
downloader_mock._split_args_input = RedditDownloader._split_args_input
downloader_mock._sanitise_subreddit_name = RedditConnector.sanitise_subreddit_name
downloader_mock._split_args_input = RedditConnector.split_args_input
downloader_mock.master_hash_list = {}
return downloader_mock
def assert_all_results_are_submissions(result_limit: int, results: list[Iterator]):
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
if result_limit is not None:
assert len(results) == result_limit
return results
def test_determine_directories(tmp_path: Path, downloader_mock: MagicMock):
downloader_mock.args.directory = tmp_path / 'test'
downloader_mock.config_directories.user_config_dir = tmp_path
RedditDownloader._determine_directories(downloader_mock)
assert Path(tmp_path / 'test').exists()
@pytest.mark.parametrize(('skip_extensions', 'skip_domains'), (
([], []),
(['.test'], ['test.com'],),
@pytest.mark.parametrize(('test_ids', 'test_excluded', 'expected_len'), (
(('aaaaaa',), (), 1),
(('aaaaaa',), ('aaaaaa',), 0),
((), ('aaaaaa',), 0),
(('aaaaaa', 'bbbbbb'), ('aaaaaa',), 1),
(('aaaaaa', 'bbbbbb', 'cccccc'), ('aaaaaa',), 2),
))
def test_create_download_filter(skip_extensions: list[str], skip_domains: list[str], downloader_mock: MagicMock):
downloader_mock.args.skip = skip_extensions
downloader_mock.args.skip_domain = skip_domains
result = RedditDownloader._create_download_filter(downloader_mock)
assert isinstance(result, DownloadFilter)
assert result.excluded_domains == skip_domains
assert result.excluded_extensions == skip_extensions
@pytest.mark.parametrize(('test_time', 'expected'), (
('all', 'all'),
('hour', 'hour'),
('day', 'day'),
('week', 'week'),
('random', 'all'),
('', 'all'),
))
def test_create_time_filter(test_time: str, expected: str, downloader_mock: MagicMock):
downloader_mock.args.time = test_time
result = RedditDownloader._create_time_filter(downloader_mock)
assert isinstance(result, RedditTypes.TimeType)
assert result.name.lower() == expected
@pytest.mark.parametrize(('test_sort', 'expected'), (
('', 'hot'),
('hot', 'hot'),
('controversial', 'controversial'),
('new', 'new'),
))
def test_create_sort_filter(test_sort: str, expected: str, downloader_mock: MagicMock):
downloader_mock.args.sort = test_sort
result = RedditDownloader._create_sort_filter(downloader_mock)
assert isinstance(result, RedditTypes.SortType)
assert result.name.lower() == expected
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
('{POSTID}', '{SUBREDDIT}'),
('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}'),
('{POSTID}', 'test'),
('{POSTID}', ''),
('{POSTID}', '{SUBREDDIT}/{REDDITOR}'),
))
def test_create_file_name_formatter(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
downloader_mock.args.file_scheme = test_file_scheme
downloader_mock.args.folder_scheme = test_folder_scheme
result = RedditDownloader._create_file_name_formatter(downloader_mock)
assert isinstance(result, FileNameFormatter)
assert result.file_format_string == test_file_scheme
assert result.directory_format_string == test_folder_scheme.split('/')
@pytest.mark.parametrize(('test_file_scheme', 'test_folder_scheme'), (
('', ''),
('', '{SUBREDDIT}'),
('test', '{SUBREDDIT}'),
))
def test_create_file_name_formatter_bad(test_file_scheme: str, test_folder_scheme: str, downloader_mock: MagicMock):
downloader_mock.args.file_scheme = test_file_scheme
downloader_mock.args.folder_scheme = test_folder_scheme
with pytest.raises(BulkDownloaderException):
RedditDownloader._create_file_name_formatter(downloader_mock)
def test_create_authenticator(downloader_mock: MagicMock):
result = RedditDownloader._create_authenticator(downloader_mock)
assert isinstance(result, SiteAuthenticator)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_ids', (
('lvpf4l',),
('lvpf4l', 'lvqnsn'),
('lvpf4l', 'lvqnsn', 'lvl9kd'),
))
def test_get_submissions_from_link(
test_submission_ids: list[str],
reddit_instance: praw.Reddit,
downloader_mock: MagicMock):
downloader_mock.args.link = test_submission_ids
downloader_mock.reddit_instance = reddit_instance
results = RedditDownloader._get_submissions_from_link(downloader_mock)
assert all([isinstance(sub, praw.models.Submission) for res in results for sub in res])
assert len(results[0]) == len(test_submission_ids)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddits', 'limit', 'sort_type', 'time_filter', 'max_expected_len'), (
(('Futurology',), 10, 'hot', 'all', 10),
(('Futurology', 'Mindustry, Python'), 10, 'hot', 'all', 30),
(('Futurology',), 20, 'hot', 'all', 20),
(('Futurology', 'Python'), 10, 'hot', 'all', 20),
(('Futurology',), 100, 'hot', 'all', 100),
(('Futurology',), 0, 'hot', 'all', 0),
(('Futurology',), 10, 'top', 'all', 10),
(('Futurology',), 10, 'top', 'week', 10),
(('Futurology',), 10, 'hot', 'week', 10),
))
def test_get_subreddit_normal(
test_subreddits: list[str],
limit: int,
sort_type: str,
time_filter: str,
max_expected_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
):
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.args.limit = limit
downloader_mock.args.sort = sort_type
downloader_mock.args.subreddit = test_subreddits
downloader_mock.reddit_instance = reddit_instance
downloader_mock.sort_filter = RedditDownloader._create_sort_filter(downloader_mock)
results = RedditDownloader._get_subreddits(downloader_mock)
test_subreddits = downloader_mock._split_args_input(test_subreddits)
results = [sub for res1 in results for sub in res1]
assert all([isinstance(res1, praw.models.Submission) for res1 in results])
assert all([res.subreddit.display_name in test_subreddits for res in results])
assert len(results) <= max_expected_len
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddits', 'search_term', 'limit', 'time_filter', 'max_expected_len'), (
(('Python',), 'scraper', 10, 'all', 10),
(('Python',), '', 10, 'all', 10),
(('Python',), 'djsdsgewef', 10, 'all', 0),
(('Python',), 'scraper', 10, 'year', 10),
(('Python',), 'scraper', 10, 'hour', 1),
))
def test_get_subreddit_search(
test_subreddits: list[str],
search_term: str,
time_filter: str,
limit: int,
max_expected_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
):
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.args.limit = limit
downloader_mock.args.search = search_term
downloader_mock.args.subreddit = test_subreddits
downloader_mock.reddit_instance = reddit_instance
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.time = time_filter
downloader_mock.time_filter = RedditDownloader._create_time_filter(downloader_mock)
results = RedditDownloader._get_subreddits(downloader_mock)
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
assert all([res.subreddit.display_name in test_subreddits for res in results])
assert len(results) <= max_expected_len
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_user', 'test_multireddits', 'limit'), (
('helen_darten', ('cuteanimalpics',), 10),
('korfor', ('chess',), 100),
))
# Good sources at https://www.reddit.com/r/multihub/
def test_get_multireddits_public(
test_user: str,
test_multireddits: list[str],
limit: int,
reddit_instance: praw.Reddit,
@patch('bdfr.site_downloaders.download_factory.DownloadFactory.pull_lever')
def test_excluded_ids(
mock_function: MagicMock,
test_ids: tuple[str],
test_excluded: tuple[str],
expected_len: int,
downloader_mock: MagicMock,
):
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.limit = limit
downloader_mock.args.multireddit = test_multireddits
downloader_mock.args.user = test_user
downloader_mock.reddit_instance = reddit_instance
downloader_mock._create_filtered_listing_generator.return_value = \
RedditDownloader._create_filtered_listing_generator(
downloader_mock,
reddit_instance.multireddit(test_user, test_multireddits[0]),
)
results = RedditDownloader._get_multireddits(downloader_mock)
results = [sub for res in results for sub in res]
assert all([isinstance(res, praw.models.Submission) for res in results])
assert len(results) == limit
downloader_mock.excluded_submission_ids = test_excluded
mock_function.return_value = MagicMock()
mock_function.return_value.__name__ = 'test'
test_submissions = []
for test_id in test_ids:
m = MagicMock()
m.id = test_id
m.subreddit.display_name.return_value = 'https://www.example.com/'
m.__class__ = praw.models.Submission
test_submissions.append(m)
downloader_mock.reddit_lists = [test_submissions]
for submission in test_submissions:
RedditDownloader._download_submission(downloader_mock, submission)
assert mock_function.call_count == expected_len
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_user', 'limit'), (
('danigirl3694', 10),
('danigirl3694', 50),
('CapitanHam', None),
@pytest.mark.parametrize('test_submission_id', (
'm1hqw6',
))
def test_get_user_submissions(test_user: str, limit: int, downloader_mock: MagicMock, reddit_instance: praw.Reddit):
downloader_mock.args.limit = limit
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
downloader_mock.args.submitted = True
downloader_mock.args.user = test_user
downloader_mock.authenticated = False
downloader_mock.reddit_instance = reddit_instance
downloader_mock._create_filtered_listing_generator.return_value = \
RedditDownloader._create_filtered_listing_generator(
downloader_mock,
reddit_instance.redditor(test_user).submissions,
)
results = RedditDownloader._get_user_data(downloader_mock)
results = assert_all_results_are_submissions(limit, results)
assert all([res.author.name == test_user for res in results])
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.authenticated
@pytest.mark.parametrize('test_flag', (
'upvoted',
'saved',
))
def test_get_user_authenticated_lists(
test_flag: str,
downloader_mock: MagicMock,
authenticated_reddit_instance: praw.Reddit,
):
downloader_mock.args.__dict__[test_flag] = True
downloader_mock.reddit_instance = authenticated_reddit_instance
downloader_mock.args.user = 'me'
downloader_mock.args.limit = 10
downloader_mock._determine_sort_function.return_value = praw.models.Subreddit.hot
downloader_mock.sort_filter = RedditTypes.SortType.HOT
RedditDownloader._resolve_user_name(downloader_mock)
results = RedditDownloader._get_user_data(downloader_mock)
assert_all_results_are_submissions(10, results)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_submission_id', 'expected_files_len'), (
('ljyy27', 4),
))
def test_download_submission(
def test_mark_hard_link(
test_submission_id: str,
expected_files_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
tmp_path: Path):
tmp_path: Path,
reddit_instance: praw.Reddit
):
downloader_mock.reddit_instance = reddit_instance
downloader_mock.download_filter.check_url.return_value = True
downloader_mock.args.folder_scheme = ''
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
downloader_mock.args.make_hard_links = True
downloader_mock.download_directory = tmp_path
downloader_mock.args.folder_scheme = ''
downloader_mock.args.file_scheme = '{POSTID}'
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
original = Path(tmp_path, f'{test_submission_id}.png')
RedditDownloader._download_submission(downloader_mock, submission)
folder_contents = list(tmp_path.iterdir())
assert len(folder_contents) == expected_files_len
assert original.exists()
downloader_mock.args.file_scheme = 'test2_{POSTID}'
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
RedditDownloader._download_submission(downloader_mock, submission)
test_file_1_stats = original.stat()
test_file_2_inode = Path(tmp_path, f'test2_{test_submission_id}.png').stat().st_ino
assert test_file_1_stats.st_nlink == 2
assert test_file_1_stats.st_ino == test_file_2_inode
@pytest.mark.online
@pytest.mark.reddit
def test_download_submission_file_exists(
@pytest.mark.parametrize(('test_submission_id', 'test_creation_date'), (
('ndzz50', 1621204841.0),
))
def test_file_creation_date(
test_submission_id: str,
test_creation_date: float,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
tmp_path: Path,
capsys: pytest.CaptureFixture
reddit_instance: praw.Reddit
):
setup_logging(3)
downloader_mock.reddit_instance = reddit_instance
downloader_mock.download_filter.check_url.return_value = True
downloader_mock.args.folder_scheme = ''
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
downloader_mock.download_directory = tmp_path
submission = downloader_mock.reddit_instance.submission(id='m1hqw6')
Path(tmp_path, 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png').touch()
downloader_mock.args.folder_scheme = ''
downloader_mock.args.file_scheme = '{POSTID}'
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
RedditDownloader._download_submission(downloader_mock, submission)
folder_contents = list(tmp_path.iterdir())
output = capsys.readouterr()
assert len(folder_contents) == 1
assert 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png already exists' in output.out
for file_path in Path(tmp_path).iterdir():
file_stats = os.stat(file_path)
assert file_stats.st_mtime == test_creation_date
def test_search_existing_files():
results = RedditDownloader.scan_existing_files(Path('.'))
assert len(results.keys()) != 0
@pytest.mark.online
@@ -358,7 +145,7 @@ def test_download_submission_hash_exists(
downloader_mock.download_filter.check_url.return_value = True
downloader_mock.args.folder_scheme = ''
downloader_mock.args.no_dupes = True
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
downloader_mock.download_directory = tmp_path
downloader_mock.master_hash_list = {test_hash: None}
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
@@ -369,165 +156,47 @@ def test_download_submission_hash_exists(
assert re.search(r'Resource hash .*? downloaded elsewhere', output.out)
@pytest.mark.parametrize(('test_name', 'expected'), (
('Mindustry', 'Mindustry'),
('Futurology', 'Futurology'),
('r/Mindustry', 'Mindustry'),
('TrollXChromosomes', 'TrollXChromosomes'),
('r/TrollXChromosomes', 'TrollXChromosomes'),
('https://www.reddit.com/r/TrollXChromosomes/', 'TrollXChromosomes'),
('https://www.reddit.com/r/TrollXChromosomes', 'TrollXChromosomes'),
('https://www.reddit.com/r/Futurology/', 'Futurology'),
('https://www.reddit.com/r/Futurology', 'Futurology'),
))
def test_sanitise_subreddit_name(test_name: str, expected: str):
result = RedditDownloader._sanitise_subreddit_name(test_name)
assert result == expected
def test_search_existing_files():
results = RedditDownloader.scan_existing_files(Path('.'))
assert len(results.keys()) >= 40
@pytest.mark.parametrize(('test_subreddit_entries', 'expected'), (
(['test1', 'test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1,test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1, test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1; test2', 'test3'], {'test1', 'test2', 'test3'}),
(['test1, test2', 'test1,test2,test3', 'test4'], {'test1', 'test2', 'test3', 'test4'})
))
def test_split_subreddit_entries(test_subreddit_entries: list[str], expected: set[str]):
results = RedditDownloader._split_args_input(test_subreddit_entries)
assert results == expected
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_submission_id', (
'm1hqw6',
))
def test_mark_hard_link(
test_submission_id: str,
def test_download_submission_file_exists(
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
tmp_path: Path,
reddit_instance: praw.Reddit
capsys: pytest.CaptureFixture
):
setup_logging(3)
downloader_mock.reddit_instance = reddit_instance
downloader_mock.args.make_hard_links = True
downloader_mock.download_directory = tmp_path
downloader_mock.download_filter.check_url.return_value = True
downloader_mock.args.folder_scheme = ''
downloader_mock.args.file_scheme = '{POSTID}'
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
downloader_mock.download_directory = tmp_path
submission = downloader_mock.reddit_instance.submission(id='m1hqw6')
Path(tmp_path, 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png').touch()
RedditDownloader._download_submission(downloader_mock, submission)
folder_contents = list(tmp_path.iterdir())
output = capsys.readouterr()
assert len(folder_contents) == 1
assert 'Arneeman_Metagaming isn\'t always a bad thing_m1hqw6.png'\
' from submission m1hqw6 already exists' in output.out
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_submission_id', 'expected_files_len'), (
('ljyy27', 4),
))
def test_download_submission(
test_submission_id: str,
expected_files_len: int,
downloader_mock: MagicMock,
reddit_instance: praw.Reddit,
tmp_path: Path):
downloader_mock.reddit_instance = reddit_instance
downloader_mock.download_filter.check_url.return_value = True
downloader_mock.args.folder_scheme = ''
downloader_mock.file_name_formatter = RedditConnector.create_file_name_formatter(downloader_mock)
downloader_mock.download_directory = tmp_path
submission = downloader_mock.reddit_instance.submission(id=test_submission_id)
original = Path(tmp_path, f'{test_submission_id}.png')
RedditDownloader._download_submission(downloader_mock, submission)
assert original.exists()
downloader_mock.args.file_scheme = 'test2_{POSTID}'
downloader_mock.file_name_formatter = RedditDownloader._create_file_name_formatter(downloader_mock)
RedditDownloader._download_submission(downloader_mock, submission)
test_file_1_stats = original.stat()
test_file_2_inode = Path(tmp_path, f'test2_{test_submission_id}.png').stat().st_ino
assert test_file_1_stats.st_nlink == 2
assert test_file_1_stats.st_ino == test_file_2_inode
@pytest.mark.parametrize(('test_ids', 'test_excluded', 'expected_len'), (
(('aaaaaa',), (), 1),
(('aaaaaa',), ('aaaaaa',), 0),
((), ('aaaaaa',), 0),
(('aaaaaa', 'bbbbbb'), ('aaaaaa',), 1),
))
def test_excluded_ids(test_ids: tuple[str], test_excluded: tuple[str], expected_len: int, downloader_mock: MagicMock):
downloader_mock.excluded_submission_ids = test_excluded
test_submissions = []
for test_id in test_ids:
m = MagicMock()
m.id = test_id
test_submissions.append(m)
downloader_mock.reddit_lists = [test_submissions]
RedditDownloader.download(downloader_mock)
assert downloader_mock._download_submission.call_count == expected_len
def test_read_excluded_submission_ids_from_file(downloader_mock: MagicMock, tmp_path: Path):
test_file = tmp_path / 'test.txt'
test_file.write_text('aaaaaa\nbbbbbb')
downloader_mock.args.exclude_id_file = [test_file]
results = RedditDownloader._read_excluded_ids(downloader_mock)
assert results == {'aaaaaa', 'bbbbbb'}
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'Paracortex',
'crowdstrike',
'HannibalGoddamnit',
))
def test_check_user_existence_good(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'lhnhfkuhwreolo',
'adlkfmnhglojh',
))
def test_check_user_existence_nonexistent(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
with pytest.raises(BulkDownloaderException, match='Could not find'):
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_redditor_name', (
'Bree-Boo',
))
def test_check_user_existence_banned(
test_redditor_name: str,
reddit_instance: praw.Reddit,
downloader_mock: MagicMock,
):
downloader_mock.reddit_instance = reddit_instance
with pytest.raises(BulkDownloaderException, match='is banned'):
RedditDownloader._check_user_existence(downloader_mock, test_redditor_name)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_subreddit_name', 'expected_message'), (
('donaldtrump', 'cannot be found'),
('submitters', 'private and cannot be scraped')
))
def test_check_subreddit_status_bad(test_subreddit_name: str, expected_message: str, reddit_instance: praw.Reddit):
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
with pytest.raises(BulkDownloaderException, match=expected_message):
RedditDownloader._check_subreddit_status(test_subreddit)
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize('test_subreddit_name', (
'Python',
'Mindustry',
'TrollXChromosomes',
'all',
))
def test_check_subreddit_status_good(test_subreddit_name: str, reddit_instance: praw.Reddit):
test_subreddit = reddit_instance.subreddit(test_subreddit_name)
RedditDownloader._check_subreddit_status(test_subreddit)
folder_contents = list(tmp_path.iterdir())
assert len(folder_contents) == expected_files_len

View File

@@ -1,17 +1,21 @@
#!/usr/bin/env python3
# coding=utf-8
import platform
import sys
import unittest.mock
from datetime import datetime
from pathlib import Path
from typing import Optional
from unittest.mock import MagicMock
import platform
import praw.models
import pytest
from bdfr.file_name_formatter import FileNameFormatter
from bdfr.resource import Resource
from bdfr.site_downloaders.base_downloader import BaseDownloader
from bdfr.site_downloaders.fallback_downloaders.ytdlp_fallback import YtdlpFallback
@pytest.fixture()
@@ -28,10 +32,10 @@ def submission() -> MagicMock:
return test
def do_test_string_equality(result: str, expected: str) -> bool:
def do_test_string_equality(result: [Path, str], expected: str) -> bool:
if platform.system() == 'Windows':
expected = FileNameFormatter._format_for_windows(expected)
return expected == result
return str(result).endswith(expected)
def do_test_path_equality(result: Path, expected: str) -> bool:
@@ -41,7 +45,7 @@ def do_test_path_equality(result: Path, expected: str) -> bool:
expected = Path(*expected)
else:
expected = Path(expected)
return result == expected
return str(result).endswith(str(expected))
@pytest.fixture(scope='session')
@@ -118,7 +122,7 @@ def test_format_full(
format_string_file: str,
expected: str,
reddit_submission: praw.models.Submission):
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
result = test_formatter.format_path(test_resource, Path('test'))
assert do_test_path_equality(result, expected)
@@ -135,7 +139,7 @@ def test_format_full_conform(
format_string_directory: str,
format_string_file: str,
reddit_submission: praw.models.Submission):
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
test_formatter.format_path(test_resource, Path('test'))
@@ -155,7 +159,7 @@ def test_format_full_with_index_suffix(
expected: str,
reddit_submission: praw.models.Submission,
):
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png')
test_resource = Resource(reddit_submission, 'i.reddit.com/blabla.png', lambda: None)
test_formatter = FileNameFormatter(format_string_file, format_string_directory, 'ISO')
result = test_formatter.format_path(test_resource, Path('test'), index)
assert do_test_path_equality(result, expected)
@@ -172,8 +176,9 @@ def test_format_multiple_resources():
mocks.append(new_mock)
test_formatter = FileNameFormatter('{TITLE}', '', 'ISO')
results = test_formatter.format_resource_paths(mocks, Path('.'))
results = set([str(res[0]) for res in results])
assert results == {'test_1.png', 'test_2.png', 'test_3.png', 'test_4.png'}
results = set([str(res[0].name) for res in results])
expected = {'test_1.png', 'test_2.png', 'test_3.png', 'test_4.png'}
assert results == expected
@pytest.mark.parametrize(('test_filename', 'test_ending'), (
@@ -183,10 +188,11 @@ def test_format_multiple_resources():
('😍💕✨' * 100, '_1.png'),
))
def test_limit_filename_length(test_filename: str, test_ending: str):
result = FileNameFormatter._limit_file_name_length(test_filename, test_ending)
assert len(result) <= 255
assert len(result.encode('utf-8')) <= 255
assert isinstance(result, str)
result = FileNameFormatter.limit_file_name_length(test_filename, test_ending, Path('.'))
assert len(result.name) <= 255
assert len(result.name.encode('utf-8')) <= 255
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
assert isinstance(result, Path)
@pytest.mark.parametrize(('test_filename', 'test_ending', 'expected_end'), (
@@ -201,25 +207,41 @@ def test_limit_filename_length(test_filename: str, test_ending: str):
('😍💕✨' * 100 + '_aaa1aa', '_1.png', '_aaa1aa_1.png'),
))
def test_preserve_id_append_when_shortening(test_filename: str, test_ending: str, expected_end: str):
result = FileNameFormatter._limit_file_name_length(test_filename, test_ending)
assert len(result) <= 255
assert len(result.encode('utf-8')) <= 255
assert isinstance(result, str)
assert result.endswith(expected_end)
result = FileNameFormatter.limit_file_name_length(test_filename, test_ending, Path('.'))
assert len(result.name) <= 255
assert len(result.name.encode('utf-8')) <= 255
assert result.name.endswith(expected_end)
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
def test_shorten_filenames(submission: MagicMock, tmp_path: Path):
submission.title = 'A' * 300
@pytest.mark.skipif(sys.platform == 'win32', reason='Test broken on windows github')
def test_shorten_filename_real(submission: MagicMock, tmp_path: Path):
submission.title = 'A' * 500
submission.author.name = 'test'
submission.subreddit.display_name = 'test'
submission.id = 'BBBBBB'
test_resource = Resource(submission, 'www.example.com/empty', '.jpeg')
test_resource = Resource(submission, 'www.example.com/empty', lambda: None, '.jpeg')
test_formatter = FileNameFormatter('{REDDITOR}_{TITLE}_{POSTID}', '{SUBREDDIT}', 'ISO')
result = test_formatter.format_path(test_resource, tmp_path)
result.parent.mkdir(parents=True)
result.touch()
@pytest.mark.parametrize(('test_name', 'test_ending'), (
('a', 'b'),
('a', '_bbbbbb.jpg'),
('a' * 20, '_bbbbbb.jpg'),
('a' * 50, '_bbbbbb.jpg'),
('a' * 500, '_bbbbbb.jpg'),
))
def test_shorten_path(test_name: str, test_ending: str, tmp_path: Path):
result = FileNameFormatter.limit_file_name_length(test_name, test_ending, tmp_path)
assert len(str(result.name)) <= 255
assert len(str(result.name).encode('UTF-8')) <= 255
assert len(str(result.name).encode('cp1252')) <= 255
assert len(str(result)) <= FileNameFormatter.find_max_path_length()
@pytest.mark.parametrize(('test_string', 'expected'), (
('test', 'test'),
('test😍', 'test'),
@@ -293,9 +315,9 @@ def test_format_archive_entry_comment(
):
test_comment = reddit_instance.comment(id=test_comment_id)
test_formatter = FileNameFormatter(test_file_scheme, test_folder_scheme, 'ISO')
test_entry = Resource(test_comment, '', '.json')
test_entry = Resource(test_comment, '', lambda: None, '.json')
result = test_formatter.format_path(test_entry, tmp_path)
assert do_test_string_equality(result.name, expected_name)
assert do_test_string_equality(result, expected_name)
@pytest.mark.parametrize(('test_folder_scheme', 'expected'), (
@@ -364,3 +386,36 @@ def test_time_string_formats(test_time_format: str, expected: str):
test_formatter = FileNameFormatter('{TITLE}', '', test_time_format)
result = test_formatter._convert_timestamp(test_time.timestamp())
assert result == expected
def test_get_max_path_length():
result = FileNameFormatter.find_max_path_length()
assert result in (4096, 260, 1024)
def test_windows_max_path(tmp_path: Path):
with unittest.mock.patch('platform.system', return_value='Windows'):
with unittest.mock.patch('bdfr.file_name_formatter.FileNameFormatter.find_max_path_length', return_value=260):
result = FileNameFormatter.limit_file_name_length('test' * 100, '_1.png', tmp_path)
assert len(str(result)) <= 260
assert len(result.name) <= (260 - len(str(tmp_path)))
@pytest.mark.online
@pytest.mark.reddit
@pytest.mark.parametrize(('test_reddit_id', 'test_downloader', 'expected_names'), (
('gphmnr', YtdlpFallback, {'He has a lot to say today.mp4'}),
('d0oir2', YtdlpFallback, {"Crunk's finest moment. Welcome to the new subreddit!.mp4"}),
))
def test_name_submission(
test_reddit_id: str,
test_downloader: type(BaseDownloader),
expected_names: set[str],
reddit_instance: praw.reddit.Reddit,
):
test_submission = reddit_instance.submission(id=test_reddit_id)
test_resources = test_downloader(test_submission).find_resources()
test_formatter = FileNameFormatter('{TITLE}', '', '')
results = test_formatter.format_resource_paths(test_resources, Path('.'))
results = set([r[0].name for r in results])
assert expected_names == results

View File

@@ -21,7 +21,7 @@ from bdfr.resource import Resource
('https://www.test.com/test/test2/example.png?random=test#thing', '.png'),
))
def test_resource_get_extension(test_url: str, expected: str):
test_resource = Resource(MagicMock(), test_url)
test_resource = Resource(MagicMock(), test_url, lambda: None)
result = test_resource._determine_extension()
assert result == expected
@@ -31,6 +31,6 @@ def test_resource_get_extension(test_url: str, expected: str):
('https://www.iana.org/_img/2013.1/iana-logo-header.svg', '426b3ac01d3584c820f3b7f5985d6623'),
))
def test_download_online_resource(test_url: str, expected_hash: str):
test_resource = Resource(MagicMock(), test_url)
test_resource.download(120)
test_resource = Resource(MagicMock(), test_url, Resource.retry_download(test_url))
test_resource.download()
assert test_resource.hash.hexdigest() == expected_hash