API

Crawler

class novelsave_sources.sources.Crawler(http_gateway: Optional[novelsave_sources.utils.gateways.BaseHttpGateway] = None)[source]

Base crawler class

Implements crawler specific helper methods that can be used when parsing html content

lang

The language of the content available through the source. It is specified multi if the source supports multiple languages.

Type

str

base_urls

The hostnames of the websites that this crawler supports.

Type

List[str]

last_updated

The date at which the specific crawler implementation was last updated.

Type

datetime.date

bad_tags

List of names of tags that should be removed from chapter content for this specific crawler.

Type

List[str]

blacklist_patterns

List of regex patterns denoting text that should be removed from chapter content.

Type

List[str]

notext_tags

List of names of tags that even if there is no text should not be removed from chapter content.

Elements with no text are usually removed from the chapter content, unless the element is specified in this list.

Type

List[str]

preserve_attrs

Element attributes that contain meaningful content and should be kept with in the element during attribute cleanup.

Type

List[str]

clean_contents(contents)[source]

Remove unnecessary elements and attributes

clean_element(element)[source]

If the element does not add any meaningful content the element is removed, this can happen on either of below conditions.

  • Element is a comment

  • Element is a <br> and the next sibling element is also a <br>

  • Element is part of the bad tags (undesired tags that dont add content)

  • The element has no text and has no children and is not part of notext_tags (elements that doesnt need text to be meaningful)

  • The text of the element matches one of the blacklisted patterns (undesirable text such as ads and watermarks)

If none of the conditions are met, all the attributes except those marked important preserve_attrs are removed from this element

static find_paragraphs(element, **kwargs) List[str][source]

Extract all text of the element into paragraphs

get_soup(url: str, method: str = 'GET', **kwargs) bs4.BeautifulSoup[source]

Makes a request to the url and attempts to make a BeautifulSoup object from the response content.

Once the response is acquired, soup object is created using make_soup(). Then the soup object is checked for the body to check if document was retrieved successfully.

Parameters
Returns

The created soup object

Return type

BeautifulSoup

Raises

ConnectionError – If document was not retrieved successfully

init()[source]

Call this method instead of __init__ for trivial purposes

The purpose can be any of:

  • editing bad_tags or blacklist_patterns

is_blacklisted(text)[source]

Whether the text is blacklisted

static make_soup(text: Union[str, bytes], parser: str = 'lxml') bs4.BeautifulSoup[source]

Create a new soup object using the specified parser

Parameters
  • text (str | bytes) – The content for the soup

  • parser (str) – The html tree parser to use (default = ‘lxml’)

Returns

The created soup object

Return type

BeautifulSoup

classmethod of(url: str) bool[source]

Check whether the url is from the this source

The source implementations may override this method to provide custom matching functionality.

The default implementation checks if the hostname of the url matches any of the base urls of the source.

Parameters

url (str) – The url to test if it belongs to this source

Returns

Whether the url is from this source

Return type

bool

request(method: str, url: str, **kwargs) requests.models.Response[source]

Send a request to the provided url using the specified method

Checks if the response is valid before returning, if its not valid throws an exception.

Parameters
  • method (str) – Request method ex: GET, POST, PUT

  • url (str) – The url endpoint to make the request to

  • kwargs – Forwarded to http_gateway.request

Returns

The response from the request

Return type

requests.Response

Raises

BadResponseException – if the response is not valid (status code != 200)

to_absolute_url(url: str, current_url: Optional[str] = None) str[source]

Detects the url state and converts it into the appropriate absolute url

There are several relevant states the url could be in:

  • absolute: starts with either ‘https://’ or ‘http://’, in this the url is returned as it without any changes.

  • missing schema: schema is missing and the url starts with ‘//’, in this case the appropriate schema from either current url or base url is prefixed.

  • relative absolute: the url is relative to the website and starts with ‘/’, in this case the base website location (netloc) is prefixed to the url:

  • relative current: the url is relative to the current webpage and does not match any of the above conditions, in this case the url is added to the current url provided.

Parameters
  • url (str) – The url to be converted

  • current_url (Optional[str]) – The webpage from which the url is extracted

Returns

The absolute converted url

Return type

str

Sources

Sources are divided into the groups:

  • Novel

    Interface to be implemented by primary novel content scrapers

  • MetaData

    Interface to be implemented by supplementary metadata scrapers

Novel source interface

class novelsave_sources.Source(*args, **kwargs)[source]

Bases: novelsave_sources.sources.crawler.Crawler

Novel source interface

All novel sources must implement this interface

name

Alternative name for the source, otherwise use the class name Source.__name__ magic attribute.

For example:

name = getattr(Source, 'name', Source.__name__)
Type

Optional[str]

login_viable

Specifies if the source has login functionality implemented.

Type

bool

search_viable

Specifies if the source has the ability to search for novels implemented.

Type

bool

__init__(*args, **kwargs)[source]

When initializing the source,

  • The source is checked for cookie domains, if there are no cookie domains they are built using the base_urls.

abstract chapter(chapter: novelsave_sources.models.chapter.Chapter)[source]

Download and parse chapter content

The typical implementation of this method retrieves the chapters reading content and updates the paragraph attribute of the provided chapter. It does not return any result.

In rare instances, other attributes of the Chapter are also updated like title.

Parameters

chapter (Chapter) – Chapter object with atleast the url attribute option filled.

login(email: str, password: str)[source]

Login to the source and assign the required cookies

Even though unlike novel and chapter, login is not marked abstract it does not have an implementation. By default, it throws an UnavailableException.

You may specify whether login is implemented using login_viable.

Parameters
  • email (str) – Email or username credentials

  • password (str) – password credentials

abstract novel(url: str) novelsave_sources.models.novel.Novel[source]

Download and parse novel information

The typical implementation of this method is very straight forward. They download and parse the profile page into a novel object. Usually the table of contents would be a part of this. However, In the other instances, additional downloads may be required.

Parameters

url (str) – The url pointing towards the main profile page

Returns

Novel object that contains the parsed data

Return type

Novel

search(keyword: str, *args, **kwargs) List[novelsave_sources.models.novel.Novel][source]

Search for a novel on the source

Even though unlike novel and chapter, search is not marked abstract it does not have an implementation. By default, it throws an UnavailableException.

You may specify whether search is implemented using search_viable.

Parameters

keyword (str) – The query text to be used in the search. Usually part of title.

Returns

The resulting novels from the search

Return type

List[Novel]

MetaData source interface

class novelsave_sources.MetaSource(*args, **kwargs)[source]

Bases: novelsave_sources.sources.crawler.Crawler

MetaData source interface

All metadata sources must implement this interface.

abstract retrieve(url: str) List[novelsave_sources.models.metadata.Metadata][source]

Retrieves metadata from url

An implementation might retrieve the metadata by requesting from an api endpoint or from scraping a website.

Parameters

url (str) – Url pointing to the metadata

Returns

List of metadata retrieved for the page.

Return type

List[Metadata]

Gateways

Http Gateway

class novelsave_sources.utils.gateways.BaseHttpGateway[source]

Base gateway interface that defines http communication

abstract property cookies: requests.cookies.RequestsCookieJar

Get current cookies being used in session

The setter for this property must also be implemented.

Returns

The cookies in the session

Return type

RequestsCookieJar

get(*args, **kwargs)[source]

Aliased method to send GET request using request() method

post(*args, **kwargs)[source]

Aliased method to send POST request using request() method

abstract request(method: str, url: str, headers: Optional[dict] = None, params: Optional[dict] = None, data: Optional[dict] = None, json: Optional[dict] = None) requests.models.Response[source]

Send an http request to the specified url using the specified options

Parameters
  • method (str) – The method of request to send. ex: GET, POST, PUT

  • url (str) – The endpoint to which the request to be made

  • headers (dict) – The headers to be send with the request. If not specified sends default headers from requests module.

  • params (dict) – The query parameters to be send with the request.

  • data (dict) – ‘x-www-form-urlencoded’ to be send with the request.

  • json (dict) – json to be sent in the request body.

Returns

The response resulting from the request

Return type

requests.Response

Default http gateway

class novelsave_sources.utils.gateways.DefaultHttpGateway[source]

Default Http gateway implementation used by sources

This implementation has the following properties:

  • Uses cloudscraper package, which detects Cloudflare’s anti-bot pages.

    self.session = cloudscraper.create_scraper(ssl_context=ctx)
    
  • Disables SSL protection, as this seems to break most sites.

    ctx = ssl.create_default_context()
    ctx.check_hostname = False
    ctx.verify_mode = ssl.CERT_NONE
    
    self.session = ... # initialize scraper session
    
    self.session.verify = False
    

    As such also disables InsecureRequestWarning in the request context.

    with warnings.catch_warnings():
        warnings.simplefilter("ignore", InsecureRequestWarning)
        # logic
    

Models

Novel

class novelsave_sources.Novel(title: str, url: str, author: typing.Optional[str] = None, synopsis: typing.List[str] = <factory>, thumbnail_url: typing.Optional[str] = None, status: typing.Optional[str] = None, lang: str = 'en', volumes: typing.List[novelsave_sources.models.volume.Volume] = <factory>, metadata: typing.List[novelsave_sources.models.metadata.Metadata] = <factory>)[source]

Data class for parsed novels

title

The name of the novel.

Type

str

url

The url pointing to the webpage of novel.

Type

str

author

The author of the novel.

Type

Optional[str]

synopsis

The description of the novel in lines or paragraphs.

Type

List[str]

thumbnail_url

The url pointing to the thumbnail image of the novel

Type

Optional[str]

status

The status of the novel, can be ongoing, completed, or hiatus

Type

Optional[str]

lang

The language of the novel. This is not the original language, however the language this novel is currently readable in.

Type

str

volumes

List of volumes of the novel

Type

List[Volume]

metadata

List of metadata of the novel

Type

List[Metadata]

add_metadata(*args, **kwargs)[source]

Shorthand for adding metadata

get_default_volume()[source]

Get or create the default volume for the novel

If the novel already has volumes, this method returns the first volume, otherwise creates and adds the default volume to novel and returns that volume.

Volume

class novelsave_sources.Volume(index: int, name: str, chapters: typing.List[novelsave_sources.models.chapter.Chapter] = <factory>)[source]

Data class that identifies a single volume in a novel

index

The order of volume in the novel. Lowest first.

Type

int

name

The name of the volume.

Type

str

chapters

The chapters belonging to volume under novel.

Type

List[Chapter]

add(chapter: novelsave_sources.models.chapter.Chapter)[source]

Shorthand method to add chapter into this volume

static default()[source]

Factory method that returns volume object with values identifying it as default.

This method is used when a particular source does not define any volumes for the novel

Chapter

class novelsave_sources.Chapter(index: int = - 1, title: Optional[str] = None, paragraphs: Optional[str] = None, url: Optional[str] = None, updated: Optional[datetime.datetime] = None)[source]

Data class that identifies a single chapter in a novel

index

The order of chapter in the novel. Lowest first.

Type

int

title

The title of the chapter.

Type

str

paragraphs

The reading content of the chapter in html.

Type

Optional[str]

url

The url pointing to the chapter in the web.

Type

str

updated

The time this chapter was last updated as defined by the source.

Type

str

Metadata

class novelsave_sources.Metadata(name: str, value: str, others: Optional[dict] = None)[source]

Data class that holds a single value of metadata for novels

name

Name of the metadata

Example: subject, tag

Type

str

value

Value of the metadata

Type

str

others

A dictionary value defining other attributes of the metadata

Type

dict

namespace

The namespace of the metadata. This is either Dublin Core (DC) or OPF.

Dublin Core (DC) has the following tags:

title, language, subject, creator, contributor, publisher, rights, coverage, date, description

The __init__() method automatically identifies the namespace.

Type

str

__init__(name: str, value: str, others: Optional[dict] = None)[source]

The namespace attribute is calculated by checking if the name exists in dublin core tags. If so, namespace is set Dublin Core (DC) otherwise it is set OPF.

Refer to name, value, and others for more details on parameters.

Utilities

This package provides two sets of utility functions for each source type.

It is important to note, that the following functions return the types and the source instantiating is left to you.

This gives you the opportunity to inject your own http gateway and override the default behaviour. Check out the gateways api section for more information.

Retrieve all novel sources

novelsave_sources.novel_source_types() List[Type[novelsave_sources.sources.novel.source.Source]][source]

Return all the available novel source types

The first usage may be slow as it searches for all the source implementations and caches the results.

Returns

All the novel source scraper implementations

Return type

List[Type[Source]]

Find the novel source that can parse a specific url

novelsave_sources.locate_novel_source(url: str) Type[novelsave_sources.sources.novel.source.Source][source]

Locate and return the novel source parser for the url if it is supported

Parameters

url (str) – Url pointing to the novel or the chapter needing to be scraped.

Returns

Specific novel scraper that supports the url provided.

Return type

Type[Source]

Raises

UnknownSourceException – if the url cannot be parsed by any existing source schema

Retrieve all metadata sources

novelsave_sources.metadata_source_types() List[Type[novelsave_sources.sources.metadata.metasource.MetaSource]][source]

Locate and return all the metadata source types

The first usage may be slow as it searches for all the source implementations and caches the results.

Returns

All the metadata source scraper implementations

Return type

List[Type[MetaSource]]

Find the metadata source that can parse a specific url

novelsave_sources.locate_metadata_source(url: str) Type[novelsave_sources.sources.metadata.metasource.MetaSource][source]

Locate and return the metadata source parser for the url if it is supported

Parameters

url (str) – Url pointing to the metadata profile.

Returns

Specific metadata scraper that supports the url provided.

Return type

Type[MetaSource]

Raises

UnknownSourceException – if the url cannot be parsed by any existing source schema

Exceptions

exception novelsave_sources.SourcesException[source]

Base exception of this package from which all other exceptions are derived from.

exception novelsave_sources.BadResponseException[source]

thrown when an unexpected response is received

exception novelsave_sources.UnknownSourceException[source]

thrown when the url does not correspond to an existing source

exception novelsave_sources.UnavailableException[source]

thrown when a function is unavailable

exception novelsave_sources.ChapterException[source]

thrown when something unexpected happens during chapter update