Crawler

class novelsave_sources.sources.Crawler(http_gateway: Optional[novelsave_sources.utils.gateways.BaseHttpGateway] = None)[source]

Base crawler class

Implements crawler specific helper methods that can be used when parsing html content

lang

The language of the content available through the source. It is specified multi if the source supports multiple languages.

Type

str

base_urls

The hostnames of the websites that this crawler supports.

Type

List[str]

last_updated

The date at which the specific crawler implementation was last updated.

Type

datetime.date

bad_tags

List of names of tags that should be removed from chapter content for this specific crawler.

Type

List[str]

blacklist_patterns

List of regex patterns denoting text that should be removed from chapter content.

Type

List[str]

notext_tags

List of names of tags that even if there is no text should not be removed from chapter content.

Elements with no text are usually removed from the chapter content, unless the element is specified in this list.

Type

List[str]

preserve_attrs

Element attributes that contain meaningful content and should be kept with in the element during attribute cleanup.

Type

List[str]

clean_contents(contents)[source]

Remove unnecessary elements and attributes

clean_element(element)[source]

If the element does not add any meaningful content the element is removed, this can happen on either of below conditions.

  • Element is a comment

  • Element is a <br> and the next sibling element is also a <br>

  • Element is part of the bad tags (undesired tags that dont add content)

  • The element has no text and has no children and is not part of notext_tags (elements that doesnt need text to be meaningful)

  • The text of the element matches one of the blacklisted patterns (undesirable text such as ads and watermarks)

If none of the conditions are met, all the attributes except those marked important preserve_attrs are removed from this element

static find_paragraphs(element, **kwargs) List[str][source]

Extract all text of the element into paragraphs

get_soup(url: str, method: str = 'GET', **kwargs) bs4.BeautifulSoup[source]

Makes a request to the url and attempts to make a BeautifulSoup object from the response content.

Once the response is acquired, soup object is created using make_soup(). Then the soup object is checked for the body to check if document was retrieved successfully.

Parameters
Returns

The created soup object

Return type

BeautifulSoup

Raises

ConnectionError – If document was not retrieved successfully

init()[source]

Call this method instead of __init__ for trivial purposes

The purpose can be any of:

  • editing bad_tags or blacklist_patterns

is_blacklisted(text)[source]

Whether the text is blacklisted

static make_soup(text: Union[str, bytes], parser: str = 'lxml') bs4.BeautifulSoup[source]

Create a new soup object using the specified parser

Parameters
  • text (str | bytes) – The content for the soup

  • parser (str) – The html tree parser to use (default = ‘lxml’)

Returns

The created soup object

Return type

BeautifulSoup

classmethod of(url: str) bool[source]

Check whether the url is from the this source

The source implementations may override this method to provide custom matching functionality.

The default implementation checks if the hostname of the url matches any of the base urls of the source.

Parameters

url (str) – The url to test if it belongs to this source

Returns

Whether the url is from this source

Return type

bool

request(method: str, url: str, **kwargs) requests.models.Response[source]

Send a request to the provided url using the specified method

Checks if the response is valid before returning, if its not valid throws an exception.

Parameters
  • method (str) – Request method ex: GET, POST, PUT

  • url (str) – The url endpoint to make the request to

  • kwargs – Forwarded to http_gateway.request

Returns

The response from the request

Return type

requests.Response

Raises

BadResponseException – if the response is not valid (status code != 200)

to_absolute_url(url: str, current_url: Optional[str] = None) str[source]

Detects the url state and converts it into the appropriate absolute url

There are several relevant states the url could be in:

  • absolute: starts with either ‘https://’ or ‘http://’, in this the url is returned as it without any changes.

  • missing schema: schema is missing and the url starts with ‘//’, in this case the appropriate schema from either current url or base url is prefixed.

  • relative absolute: the url is relative to the website and starts with ‘/’, in this case the base website location (netloc) is prefixed to the url:

  • relative current: the url is relative to the current webpage and does not match any of the above conditions, in this case the url is added to the current url provided.

Parameters
  • url (str) – The url to be converted

  • current_url (Optional[str]) – The webpage from which the url is extracted

Returns

The absolute converted url

Return type

str