API
Crawler
- class novelsave_sources.sources.Crawler(http_gateway: Optional[novelsave_sources.utils.gateways.BaseHttpGateway] = None)[source]
Base crawler class
Implements crawler specific helper methods that can be used when parsing html content
- lang
The language of the content available through the source. It is specified
multiif the source supports multiple languages.- Type
str
- base_urls
The hostnames of the websites that this crawler supports.
- Type
List[str]
- last_updated
The date at which the specific crawler implementation was last updated.
- Type
datetime.date
- bad_tags
List of names of tags that should be removed from chapter content for this specific crawler.
- Type
List[str]
- blacklist_patterns
List of regex patterns denoting text that should be removed from chapter content.
- Type
List[str]
- notext_tags
List of names of tags that even if there is no text should not be removed from chapter content.
Elements with no text are usually removed from the chapter content, unless the element is specified in this list.
- Type
List[str]
- preserve_attrs
Element attributes that contain meaningful content and should be kept with in the element during attribute cleanup.
- Type
List[str]
- clean_element(element)[source]
If the element does not add any meaningful content the element is removed, this can happen on either of below conditions.
Element is a comment
Element is a <br> and the next sibling element is also a <br>
Element is part of the bad tags (undesired tags that dont add content)
The element has no text and has no children and is not part of notext_tags (elements that doesnt need text to be meaningful)
The text of the element matches one of the blacklisted patterns (undesirable text such as ads and watermarks)
If none of the conditions are met, all the attributes except those marked important
preserve_attrsare removed from this element
- static find_paragraphs(element, **kwargs) List[str][source]
Extract all text of the element into paragraphs
- get_soup(url: str, method: str = 'GET', **kwargs) bs4.BeautifulSoup[source]
Makes a request to the url and attempts to make a
BeautifulSoupobject from the response content.Once the response is acquired, soup object is created using
make_soup(). Then the soup object is checked for thebodyto check if document was retrieved successfully.
- init()[source]
Call this method instead of __init__ for trivial purposes
The purpose can be any of:
editing bad_tags or blacklist_patterns
- static make_soup(text: Union[str, bytes], parser: str = 'lxml') bs4.BeautifulSoup[source]
Create a new soup object using the specified parser
- Parameters
text (str | bytes) – The content for the soup
parser (str) – The html tree parser to use (default = ‘lxml’)
- Returns
The created soup object
- Return type
BeautifulSoup
- classmethod of(url: str) bool[source]
Check whether the url is from the this source
The source implementations may override this method to provide custom matching functionality.
The default implementation checks if the hostname of the url matches any of the base urls of the source.
- Parameters
url (str) – The url to test if it belongs to this source
- Returns
Whether the url is from this source
- Return type
bool
- request(method: str, url: str, **kwargs) requests.models.Response[source]
Send a request to the provided url using the specified method
Checks if the response is valid before returning, if its not valid throws an exception.
- Parameters
method (str) – Request method ex: GET, POST, PUT
url (str) – The url endpoint to make the request to
kwargs – Forwarded to
http_gateway.request
- Returns
The response from the request
- Return type
requests.Response
- Raises
BadResponseException – if the response is not valid (status code != 200)
- to_absolute_url(url: str, current_url: Optional[str] = None) str[source]
Detects the url state and converts it into the appropriate absolute url
There are several relevant states the url could be in:
absolute: starts with either ‘https://’ or ‘http://’, in this the url is returned as it without any changes.
missing schema: schema is missing and the url starts with ‘//’, in this case the appropriate schema from either current url or base url is prefixed.
relative absolute: the url is relative to the website and starts with ‘/’, in this case the base website location (netloc) is prefixed to the url:
relative current: the url is relative to the current webpage and does not match any of the above conditions, in this case the url is added to the current url provided.
- Parameters
url (str) – The url to be converted
current_url (Optional[str]) – The webpage from which the url is extracted
- Returns
The absolute converted url
- Return type
str
Sources
Sources are divided into the groups:
- Novel
Interface to be implemented by primary novel content scrapers
- MetaData
Interface to be implemented by supplementary metadata scrapers
Novel source interface
- class novelsave_sources.Source(*args, **kwargs)[source]
Bases:
novelsave_sources.sources.crawler.CrawlerNovel source interface
All novel sources must implement this interface
- name
Alternative name for the source, otherwise use the class name
Source.__name__magic attribute.For example:
name = getattr(Source, 'name', Source.__name__)
- Type
Optional[str]
- login_viable
Specifies if the source has login functionality implemented.
- Type
bool
- search_viable
Specifies if the source has the ability to search for novels implemented.
- Type
bool
- __init__(*args, **kwargs)[source]
When initializing the source,
The source is checked for cookie domains, if there are no cookie domains they are built using the
base_urls.
- abstract chapter(chapter: novelsave_sources.models.chapter.Chapter)[source]
Download and parse chapter content
The typical implementation of this method retrieves the chapters reading content and updates the
paragraphattribute of the provided chapter. It does not return any result.In rare instances, other attributes of the
Chapterare also updated liketitle.
- login(email: str, password: str)[source]
Login to the source and assign the required cookies
Even though unlike novel and chapter, login is not marked abstract it does not have an implementation. By default, it throws an
UnavailableException.You may specify whether login is implemented using
login_viable.- Parameters
email (str) – Email or username credentials
password (str) – password credentials
- abstract novel(url: str) novelsave_sources.models.novel.Novel[source]
Download and parse novel information
The typical implementation of this method is very straight forward. They download and parse the profile page into a novel object. Usually the table of contents would be a part of this. However, In the other instances, additional downloads may be required.
- Parameters
url (str) – The url pointing towards the main profile page
- Returns
Novel object that contains the parsed data
- Return type
- search(keyword: str, *args, **kwargs) List[novelsave_sources.models.novel.Novel][source]
Search for a novel on the source
Even though unlike novel and chapter, search is not marked abstract it does not have an implementation. By default, it throws an
UnavailableException.You may specify whether search is implemented using
search_viable.- Parameters
keyword (str) – The query text to be used in the search. Usually part of title.
- Returns
The resulting novels from the search
- Return type
List[Novel]
MetaData source interface
- class novelsave_sources.MetaSource(*args, **kwargs)[source]
Bases:
novelsave_sources.sources.crawler.CrawlerMetaData source interface
All metadata sources must implement this interface.
- abstract retrieve(url: str) List[novelsave_sources.models.metadata.Metadata][source]
Retrieves metadata from url
An implementation might retrieve the metadata by requesting from an api endpoint or from scraping a website.
- Parameters
url (str) – Url pointing to the metadata
- Returns
List of metadata retrieved for the page.
- Return type
List[Metadata]
Gateways
Http Gateway
- class novelsave_sources.utils.gateways.BaseHttpGateway[source]
Base gateway interface that defines http communication
- abstract property cookies: requests.cookies.RequestsCookieJar
Get current cookies being used in session
The setter for this property must also be implemented.
- Returns
The cookies in the session
- Return type
RequestsCookieJar
- abstract request(method: str, url: str, headers: Optional[dict] = None, params: Optional[dict] = None, data: Optional[dict] = None, json: Optional[dict] = None) requests.models.Response[source]
Send an http request to the specified url using the specified options
- Parameters
method (str) – The method of request to send. ex: GET, POST, PUT
url (str) – The endpoint to which the request to be made
headers (dict) – The headers to be send with the request. If not specified sends default headers from requests module.
params (dict) – The query parameters to be send with the request.
data (dict) – ‘x-www-form-urlencoded’ to be send with the request.
json (dict) – json to be sent in the request body.
- Returns
The
responseresulting from the request- Return type
requests.Response
Default http gateway
- class novelsave_sources.utils.gateways.DefaultHttpGateway[source]
Default Http gateway implementation used by sources
This implementation has the following properties:
Uses cloudscraper package, which detects Cloudflare’s anti-bot pages.
self.session = cloudscraper.create_scraper(ssl_context=ctx)
Disables SSL protection, as this seems to break most sites.
ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE self.session = ... # initialize scraper session self.session.verify = False
As such also disables
InsecureRequestWarningin the request context.with warnings.catch_warnings(): warnings.simplefilter("ignore", InsecureRequestWarning) # logic
Models
Novel
- class novelsave_sources.Novel(title: str, url: str, author: typing.Optional[str] = None, synopsis: typing.List[str] = <factory>, thumbnail_url: typing.Optional[str] = None, status: typing.Optional[str] = None, lang: str = 'en', volumes: typing.List[novelsave_sources.models.volume.Volume] = <factory>, metadata: typing.List[novelsave_sources.models.metadata.Metadata] = <factory>)[source]
Data class for parsed novels
- title
The name of the novel.
- Type
str
- url
The url pointing to the webpage of novel.
- Type
str
- author
The author of the novel.
- Type
Optional[str]
- synopsis
The description of the novel in lines or paragraphs.
- Type
List[str]
- thumbnail_url
The url pointing to the thumbnail image of the novel
- Type
Optional[str]
- status
The status of the novel, can be ongoing, completed, or hiatus
- Type
Optional[str]
- lang
The language of the novel. This is not the original language, however the language this novel is currently readable in.
- Type
str
Volume
- class novelsave_sources.Volume(index: int, name: str, chapters: typing.List[novelsave_sources.models.chapter.Chapter] = <factory>)[source]
Data class that identifies a single volume in a novel
- index
The order of volume in the novel. Lowest first.
- Type
int
- name
The name of the volume.
- Type
str
- add(chapter: novelsave_sources.models.chapter.Chapter)[source]
Shorthand method to add chapter into this volume
Chapter
- class novelsave_sources.Chapter(index: int = - 1, title: Optional[str] = None, paragraphs: Optional[str] = None, url: Optional[str] = None, updated: Optional[datetime.datetime] = None)[source]
Data class that identifies a single chapter in a novel
- index
The order of chapter in the novel. Lowest first.
- Type
int
- title
The title of the chapter.
- Type
str
- paragraphs
The reading content of the chapter in html.
- Type
Optional[str]
- url
The url pointing to the chapter in the web.
- Type
str
- updated
The time this chapter was last updated as defined by the source.
- Type
str
Metadata
- class novelsave_sources.Metadata(name: str, value: str, others: Optional[dict] = None)[source]
Data class that holds a single value of metadata for novels
- name
Name of the metadata
Example:
subject,tag- Type
str
- value
Value of the metadata
- Type
str
- others
A dictionary value defining other attributes of the metadata
- Type
dict
- namespace
The namespace of the metadata. This is either Dublin Core (DC) or OPF.
Dublin Core (DC) has the following tags:
title,language,subject,creator,contributor,publisher,rights,coverage,date,descriptionThe
__init__()method automatically identifies the namespace.- Type
str
Utilities
This package provides two sets of utility functions for each source type.
It is important to note, that the following functions return the types and the source instantiating is left to you.
This gives you the opportunity to inject your own http gateway and override the default behaviour. Check out the gateways api section for more information.
Retrieve all novel sources
- novelsave_sources.novel_source_types() List[Type[novelsave_sources.sources.novel.source.Source]][source]
Return all the available novel source types
The first usage may be slow as it searches for all the source implementations and caches the results.
- Returns
All the novel source scraper implementations
- Return type
List[Type[Source]]
Find the novel source that can parse a specific url
- novelsave_sources.locate_novel_source(url: str) Type[novelsave_sources.sources.novel.source.Source][source]
Locate and return the novel source parser for the url if it is supported
- Parameters
url (str) – Url pointing to the novel or the chapter needing to be scraped.
- Returns
Specific novel scraper that supports the url provided.
- Return type
Type[Source]
- Raises
UnknownSourceException – if the url cannot be parsed by any existing source schema
Retrieve all metadata sources
- novelsave_sources.metadata_source_types() List[Type[novelsave_sources.sources.metadata.metasource.MetaSource]][source]
Locate and return all the metadata source types
The first usage may be slow as it searches for all the source implementations and caches the results.
- Returns
All the metadata source scraper implementations
- Return type
List[Type[MetaSource]]
Find the metadata source that can parse a specific url
- novelsave_sources.locate_metadata_source(url: str) Type[novelsave_sources.sources.metadata.metasource.MetaSource][source]
Locate and return the metadata source parser for the url if it is supported
- Parameters
url (str) – Url pointing to the metadata profile.
- Returns
Specific metadata scraper that supports the url provided.
- Return type
Type[MetaSource]
- Raises
UnknownSourceException – if the url cannot be parsed by any existing source schema
Exceptions
- exception novelsave_sources.SourcesException[source]
Base exception of this package from which all other exceptions are derived from.
- exception novelsave_sources.BadResponseException[source]
thrown when an unexpected response is received
- exception novelsave_sources.UnknownSourceException[source]
thrown when the url does not correspond to an existing source
thrown when a function is unavailable