minifold package#

Submodules#

minifold.binary_predicate module#

class BinaryPredicate(left, operator, right)[source]#

Bases: object

The BinaryPredicate represents a binary predicate to define minifold queries. See also the Query class (especially, the filters attribute).

Constructor.

Example

>>> from minifold import BinaryPredicate
>>> bp = BinaryPredicate("a", "<=", 0)
>>> entry = {"a": 1, "b": 2}
>>> bp(entry)
False
>>> entry = {"a": -1, "b": 2}
>>> bp(entry)
True
Parameters:
  • left (str) – The left operand. Maybe the key of an entry.

  • operator (str) – The binary operator, see also OPERATORS.

  • right (str) – The right operand. Cannot be the key of an entry.

property left: str#

Retrieves the left operand of this BinaryPredicate.

Returns:

The left operand of this BinaryPredicate.

match(entry: dict) bool[source]#

Matches an entry against this BinaryPredicate.

Parameters:

entry (dict) – An entry. One of its key must corresponds to self.left.

Raises:

KeyError – if the entry does not have key matching self.left.

Returns:

True if this BinaryPredicate is satisfied by entry, False otherwise.

property operator: str#

Retrieves the binary operator of this BinaryPredicate.

Returns:

The binary operator of this BinaryPredicate.

property right#

Retrieves the right operand of this BinaryPredicate.

Returns:

The right operand of this BinaryPredicate.

minifold.cache module#

class CacheConnector(child)[source]#

Bases: Connector

CacheConnector is a Connector is an abstract class used to cache results in the middle of a minifold query plan. If a query reaches this CacheEntriesConnector and is in cache, it is not forwarded to the underlying connector, hence accelerating the query plan execution.

See specializations:

Possible improvements:

  • For the moment the cache is class-name based. It should rather be identify by the connector setup and the underlying connectors (if any)

  • For the moment, the cache is only used if the exact same Query was issued over the past. But we should be able to reuse the cache if a less strict query has been issued, and use Connector.reshape_entries afterwards.

Constructor.

attributes(object: str) set[source]#

Retrieves the keys supported by this CacheEntriesConnector instance.

Returns:

The key of the underlying entries.

callback_read(query: Query) object[source]#

Callback triggered when data must be fetched in this CacheEntriesConnector.

Parameters:

query (Query) – The handled Query instance.

Raises:

RuntimeError – if not overloaded.

Returns:

The fetched data.

callback_write(query: Query, data: object)[source]#

Callback triggered when data must be saved in this CacheEntriesConnector instance.

Parameters:
  • query (Query) – The handled Query instance.

  • data (object) – The data to be saved.

Raises:

RuntimeError – if not overloaded.

clear_cache()[source]#

Clears this CacheEntriesConnector instance entirely. This method should be overloaded.

clear_query(query: Query)[source]#

Removes a Query result from this CacheEntriesConnector instance. This method should be overloaded.

Parameters:

query (Query) – The handled Query instance.

is_cachable(query: Query, data: object) bool[source]#

Checks whether a data may be saved in this CacheEntriesConnector instance.

Parameters:
  • query (Query) – The handled Query instance.

  • data (object) – The data fetched by this Query that must be saved to this CacheEntriesConnector instance.

Returns:

True if query may be cached in this CacheEntriesConnector instance, False otherwise.

is_cached(query: Query) bool[source]#

Checks whether a Query instance is already cached in this CacheEntriesConnector instance.

Parameters:

query (Query) – The handled Query instance.

Returns:

True if query is cached in this CacheEntriesConnector instance, False otherwise.

query(query: Query) list[source]#

Handles an incoming Query instance.

The CacheEntriesConnector checks whether it is already cached. If query is cached in this CacheEntriesConnector, it is not forwarded to self.child and the results are directly from the cache. Otherwise, it is forwarded to self.child.

Parameters:

query (Query) – The handled Query instance.

Returns:

The corresponding entries.

read(query: Query) object[source]#

Fetches from this CacheEntriesConnector instance the corresponding data.

Returns:

The corresponding cached object.

write(query: Query, data: object) bool[source]#

Writes data to this CacheEntriesConnector instance.

Parameters:
  • query (Query) – The handled Query instance.

  • data (object) – The data fetched by this Query that must be saved to this CacheEntriesConnector instance.

Returns:

The corresponding cached object.

class JsonCacheConnector(child: Connector, lifetime: timedelta = datetime.timedelta(days=3), cache_dir: str = None)[source]#

Bases: StorageCacheConnector

JsonCacheConnector overloads StorageCacheConnector to cache result in JSON files.

Constructor.

Parameters:
  • child (Connector) – The child Connector instance.

  • lifetime (datetime.timedelta) – The lifetime of the cached objects. Pass None to use the default lifetime.

  • cache_dir (str) – The path to the cache directory. Pass None to use the default minifold cache directory.

class PickleCacheConnector(child: Connector, lifetime: timedelta = datetime.timedelta(days=3), cache_dir: str = None)[source]#

Bases: StorageCacheConnector

PickleCacheConnector overloads StorageCacheConnector to cache result in pickle files.

Constructor.

Parameters:
  • child (Connector) – The child Connector instance.

  • lifetime (datetime.timedelta) – The lifetime of the cached objects. Pass None to use the default lifetime.

  • cache_dir (str) – The path to the cache directory. Pass None to use the default minifold cache directory.

class StorageCacheConnector(child: Connector, callback_load: callable = None, callback_dump: callable = None, lifetime: timedelta = None, cache_dir: str = None, read_mode: str = 'r', write_mode: str = 'w', extension: str = '')[source]#

Bases: CacheConnector

StorageCacheConnector is an abstract class that specializes of CacheConnector to manage file on the local storage.

See specializations:

Constructor.

Parameters:
  • child (Connector) – The child Connector instance.

  • callback_load (callable) – A function callback_load(f) where f is the read file descriptor of the cache and which returns the cached object.

  • callback_dump (callable) – A function callback_load(data, f) where f is the read file descriptor of the cache and data is the object to be cached.

  • lifetime (datetime.timedelta) – The lifetime of the cached objects. Pass None to use the default lifetime.

  • cache_dir (str) – The path to the cache directory. Pass None to use the default minifold cache directory.

  • read_mode (str) – The string specifying how to open the file descriptor to read the cache. Possible values are "r" (text cache) and "rb" (binary cache).

  • read_mode – The string specifying how to open the file descriptor to read the cache. Possible values are "r" (text cache) and "rb" (binary cache).

  • extension (str) – The extension of the cache filename.

base_dir = '/home/docs/.minifold/cache'#
callback_read(query: Query) object[source]#

Callback triggered when data must be fetched in this StorageCacheConnector.

Parameters:

query (Query) – The handled Query instance.

Returns:

The fetched data.

callback_write(query: Query, data: object)[source]#

Callback triggered when data must be saved in this StorageCacheConnector instance.

Parameters:
  • query (Query) – The handled Query instance.

  • data (object) – The data to be saved.

clear_cache()[source]#

Clears this CacheEntriesConnector instance entirely.

clear_query(query: Query)[source]#

Removes a Query result from this StorageCacheConnector instance.

Parameters:

query (Query) – The handled Query instance.

is_cached(query: Query) bool[source]#

Checks whether a Query instance is already cached in this StorageCacheConnector instance.

Parameters:

query (Query) – The handled Query instance.

Returns:

True if query is cached in this StorageCacheConnector instance, False otherwise.

static is_fresh_cache(cache_filename: str, lifetime: timedelta) bool[source]#

Checks whether a minifold cache file is fresh enough to be relevant.

To do so, StorageCacheConnector.is_fresh_cache fetches the date of the cache file from the filesystem and matches it against lifetime.

Parameters:
  • cache_filename (str) – The path to the cache.

  • lifetime (datetime.timedelta) – The lifetime of the cache.

Returns:

True if and only if the cache is fresh enough, False otherwise.

lifetime = datetime.timedelta(days=3)#
make_cache_filename(query: Query) str[source]#

Crafts the filename of the cache to store a given Query instance.

Parameters:

query (Query) – The handled Query instance.

Returns:

The corresponding filename.

make_cache_dir(base_dir: str, sub_dir: str = '')[source]#

Crafts the path to a minifold cache file.

Parameters:
  • base_dir (str) – The directory storing the minifold caches. See also DEFAULT_CACHE_STORAGE_BASE_DIR.

  • sub_dir (str) – The directory storing the minifold caches. This is often the name assigned to a connector pulling remote data.

minifold.cached module#

class CachedEntriesConnector(load_entries: callable, cache_filename: str, load_cache: callable, save_cache: callable, read_mode: str, write_mode: str, with_cache: bool = True)[source]#

Bases: EntriesConnector

CachedEntriesConnector is a Connector is an abstract class used to fetch data from a cache saved on the local storage.

See specializations:

Constructor.

Parameters:
  • load_entries (callable) – A function called to populate this CachedEntriesConnector.

  • cache_filename (str) – The path to the file used to save the cache on the local storage.

  • load_cache (callable) – A function entries = load_cache(f) where: entries is a list of dictionnaries; f is the (read) file descriptor of the cache.

  • save_cache (callable) – A function save_cache(cache_filename, f) where: entries is a list of dictionnaries; f is the (write) file descriptor of the cache.

  • read_mode (str) – A string specifying how the read file decriptor of the cache must be created. Possible values are "r" (text-based cache) and "rb" (binary cache).

  • write_mode (str) – A string specifying how the write file decriptor of the cache must be created. Possible values are "w" (text-based cache) and "wb" (binary cache).

class JsonCachedConnector(load_entries: callable, cache_filename: str, **kwargs)[source]#

Bases: CachedEntriesConnector

The JsonCachedConnector class implements the CachedEntriesConnector using a tierce JSON file.

Constructor. See also CachedEntriesConnector.__init__

Parameters:
  • load_entries (callable) – A function called to populate this CachedEntriesConnector.

  • cache_filename (str) – The path to the file used to save the cache on the local storage.

class PickleCachedConnector(load_entries: callable, cache_filename: str, **kwargs)[source]#

Bases: CachedEntriesConnector

The PickleCachedConnector class implements the CachedEntriesConnector using a tierce pickle file.

Constructor. See also CachedEntriesConnector.__init__

Parameters:
  • load_entries (callable) – A function called to populate this CachedEntriesConnector.

  • cache_filename (str) – The path to the file used to save the cache on the local storage.

minifold.closure module#

closure(key: object, fds: dict) set[source]#

Compute the closure of a key given a set of functional dependencies.

Parameters:
  • key (object) – A key. Possible types are str or set, frozenset or list of strings.

  • fds (dict) – A dictionary where each key-value pair represents a functional dependency (by mapping a key with another key).

Returns:

The “reachable” attributes from key.

is_multiple_key(key: object) bool[source]#

Checks whether a key is a multiple key (i.e., a key involving multiple attributes).

Parameters:

key (object) – A key. Possible types are str or set, frozenset or list of strings.

Returns:

True if and only if key is multiple, False otherwise.

minimal_cover(fds: set) set[source]#

Retricts a set of functional dependencies to make it 3-NF.

Parameters:

fds (dict) – A dictionary where each key-value pair represents a functional dependency (by mapping a key with another key).

Returns:

The corresponding subset of functional dependencies.

minifold.config module#

class Config(*args, **kwargs)[source]#

Bases: dict

Config is used to centralize the minifold configuration, possibly fetched from multiple files.

Note this class inherits Singleton, meaning that any instance of this class corresponds to the same instance.

Each piece of configuration must be a dictionary mapping abitrary, where keys are (minifold gateway) arbitary names. Each gateway is mapped with a dictionnary which maps "key" with the appropriate Connector type and "args" with the parameters to be passed to the corresponding constructor.

These configuration pieces are typically stored in configuration file that resides in ~/.minifold/conf. By convention, each JSON file stored in this directory is named “gw_type:gw_name.json” where gw_type helps to understand the underlying minifold connector and gw_name identifies the nature of the data source.

See DEFAULT_MINIFOLD_CONFIG.

Example

>>> from minifold.dblp import DblpConnector
>>> config = Config()
>>> config.loads(DEFAULT_MINIFOLD_CONFIG)
>>> dblp1 = config.make_connector("dblp:dagstuhl")
>>> dblp2 = config.make_connector("dblp:uni-trier")
load(stream)[source]#

Populates the Config instance from an input stream (e.g., a read file descriptor) storing JSON data.

Parameters:

stream – The input JSON stream.

load_file(filename: str)[source]#

Populates the Config instance from an JSON file.

Parameters:

filename (str) – The path to the input JSON file. Example: ~/.minifold/conf/gw_type:gw_name.json.

loads(s_json: str)[source]#

Populates the Config instance from a JSON string.

Parameters:

s_json (str) – A string containing JSON configuration.

make_connector(name: str) Connector[source]#

Makes a Connector thanks to the Config instance.

Parameters:

name (str) – The name identifying the Connector in the minifold configuration

:raises KeyError, if name is not in the Config instance.:

Returns:

The corresponding Connector instance.

minifold.connector module#

class Connector[source]#

Bases: object

The Connector class is the base class of most of classes involved in minifold.

A minifold query plan is a hierarchy of Connector instances. This hierarchy forms a pipeline, where leaves corresponds to gateways to some data sources and internal nodes to SQL-like operators or intermediate caches.

Running a query consists in sending a Query instance to the root Connector instance of the query plan. Then, each Connector (possibly alters) and forwards the query to its children. The leaves of the query plan are Connector are gateways allowing to fetch data from a remote or local data source. A read query returns the entries matched by the input minifold query (if any) to the parent node. Iteratively, the entries reach the root node of the query plan.

As a result, once the hierarchy of the query plan is ready, the only relevant entry point is the root Connector instance.

Constructor.

answer(query: Query, ret: list)[source]#

Method traversed when this Connector is ready to answer to a given Query.

This method is used in the child classes to trace entries in the (complex) query plans.

Parameters:
  • query (Query) – The related Query instance.

  • ret (list) – The corresponding results.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this Connector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

static get_class(name: str)[source]#
query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

reshape_entries(query: Query, entries: list) list[source]#

Reshapes entries returned by self.query() before calling self.answer().

This method should only be called if the Connector only support a subset of query operators among {SELECT, WHERE, LIMIT, OFFSET} in SQL.

Parameters:
  • query (Query) – The handled Query instance.

  • entries (list) – The list of raw entries fetched so far, corresponding to SELECT * FROM foo LIMIT n WHERE n >= query.limit.

Returns:

The reshaped entries.

subclasses = {'minifold.cache.CacheConnector': <class 'minifold.cache.CacheConnector'>, 'minifold.cache.JsonCacheConnector': <class 'minifold.cache.JsonCacheConnector'>, 'minifold.cache.PickleCacheConnector': <class 'minifold.cache.PickleCacheConnector'>, 'minifold.cache.StorageCacheConnector': <class 'minifold.cache.StorageCacheConnector'>, 'minifold.cached.CachedEntriesConnector': <class 'minifold.cached.CachedEntriesConnector'>, 'minifold.cached.JsonCachedConnector': <class 'minifold.cached.JsonCachedConnector'>, 'minifold.cached.PickleCachedConnector': <class 'minifold.cached.PickleCachedConnector'>, 'minifold.count.CountConnector': <class 'minifold.count.CountConnector'>, 'minifold.csv.CsvConnector': <class 'minifold.csv.CsvConnector'>, 'minifold.dblp.DblpConnector': <class 'minifold.dblp.DblpConnector'>, 'minifold.download.DownloadConnector': <class 'minifold.download.DownloadConnector'>, 'minifold.entries_connector.EntriesConnector': <class 'minifold.entries_connector.EntriesConnector'>, 'minifold.google_scholar.GoogleScholarConnector': <class 'minifold.google_scholar.GoogleScholarConnector'>, 'minifold.group_by.GroupByConnector': <class 'minifold.group_by.GroupByConnector'>, 'minifold.hal.HalConnector': <class 'minifold.hal.HalConnector'>, 'minifold.html_table.HtmlTableConnector': <class 'minifold.html_table.HtmlTableConnector'>, 'minifold.join_if.JoinIfConnector': <class 'minifold.join_if.JoinIfConnector'>, 'minifold.json.JsonConnector': <class 'minifold.json.JsonConnector'>, 'minifold.json.JsonFileConnector': <class 'minifold.json.JsonFileConnector'>, 'minifold.lambdas.LambdasConnector': <class 'minifold.lambdas.LambdasConnector'>, 'minifold.ldap.LdapConnector': <class 'minifold.ldap.LdapConnector'>, 'minifold.limit.LimitConnector': <class 'minifold.limit.LimitConnector'>, 'minifold.mongo.MongoConnector': <class 'minifold.mongo.MongoConnector'>, 'minifold.natural_join.NaturalJoinConnector': <class 'minifold.natural_join.NaturalJoinConnector'>, 'minifold.rename.RenameConnector': <class 'minifold.rename.RenameConnector'>, 'minifold.select.SelectConnector': <class 'minifold.select.SelectConnector'>, 'minifold.sort_by.SortByConnector': <class 'minifold.sort_by.SortByConnector'>, 'minifold.twitter.TwitterConnector': <class 'minifold.twitter.TwitterConnector'>, 'minifold.union.UnionConnector': <class 'minifold.union.UnionConnector'>, 'minifold.unique.UniqueConnector': <class 'minifold.unique.UniqueConnector'>, 'minifold.unnest.UnnestConnector': <class 'minifold.unnest.UnnestConnector'>, 'minifold.where.WhereConnector': <class 'minifold.where.WhereConnector'>}#
trace_entries = False#
trace_only_keys = False#
trace_queries = False#

minifold.connector_util module#

This file gathers some useful function to get and display some entries from an abitrary Connector instance.

get_values(connector: Connector, attribute: str) list[source]#

Retrieves distinct values mapped with a given attribute stored in the entries of a given Connector instance.

Parameters:
  • connector (Connector) – The queried Connector instance.

  • attribute (str) – The queried attribute.

Raises:

KeyError – if attribute is not a valid attribute.

Returns:

The corresponding values.

show_some_values(connector: Connector, limit: int = 5, max_strlen: int = None, query: Query = None)[source]#

Displays in Jupyter notebook some values for each attribute of the entries served by a given Connector instance. This is useful to discover a dataset.

Parameters:
  • connector (Connector) – The queried Connector instance.

  • limit (int) – A positive integer or None (no limit) that upper bounds the number of values to be fetched for each attribute.

  • query (Query) – The Query instance used to probe connector. You may pass None to issue to default query.

minifold.count module#

class CountConnector(child: Connector)[source]#

Bases: Connector

The CountConnector class is used to implement the COUNT statement in a minifold query plan. As it is one of the rare connector returning an integer (instead of a list of entries) this is often the root connector in the tree modeling the minifold query plan.

Constructor.

Parameters:

child (Connector) – The child Connector instance.

query(query: Query) int[source]#

Handles a minifold query.

Parameters:

query (Query) – The handled Query instance.

Raises:

ValueError – if query.action != ACTION_READ.

Returns:

The number of entries matched by this Query.

count(entries: list) int[source]#

Count the number of entries.

Example

>>> count([{"a": 10}, {"a": 20}])
2
Parameters:

entries (list) – A list of entries.

Returns:

The number of element in entries.

count_gen(gen: iter) int[source]#

Count the number of elements listed by a generator.

Parameters:

gen (iter) – A generator.

Returns:

The number of elements listed by a generator.

minifold.country module#

This file provides utilities to wrap the pycountry python module.

country_code_to_name(country_code: str) str[source]#

Retrieves the name of a country given its code.

Parameters:

country_code (str) – The input country code.

Returns:

The corresponding country name.

minifold.csv module#

class CsvConnector(data: str, delimiter: chr = ' ', quotechar: chr = '"', mode: CsvModeEnum = CsvModeEnum.FILENAME)[source]#

Bases: Connector

The CsvConnector is a minifold gateway allowing to manipulate data stored in CSV file.

Constructor.

Parameters:
  • data (str) – Depending on the nature of the CSV data source, this string either contains the path to the CSV file; or the CSV data itself; or the TextIOBase instance. See the mode parameter.

  • delimiter (str) – The string that delimits each column of the input CSV. Example: ";", "|", " ". Defaults to ' '.

  • quotechar (str) – The charater used to delimits values that may contain delimiter. Defaults to '"'.

  • mode (CsvModeEnum) – The nature of the input CSV source. See also the CsvModeEnum enumeration.

attributes(object: str)[source]#

Lists the attributes of the collection of objects stored in this CsvConnector instance.

Parameters:

object (str) – The name of the minifold object. As a CsvConnector instance stores a single collection, object is no relevant and you may pass None.

Returns:

The set of available object’s attributes

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching query.

class CsvModeEnum(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: IntEnum

Enumeration listing the different kind of supported CSV data source.

FILENAME = 0#
STRING = 1#
TEXTIO = 2#

minifold.dblp module#

The DBLP Computer science bibliography provides open bibliographic information on major computer science journals and proceedings.

Some query examples:

  • Retrieves articles authored by Fabien Mathieu, in JSON format <https://dblp.dagstuhl.de/search/publ/api?q=fabien-mathieu&h=500&format=json>

  • Retrieves articles authored by Fabien Mathieu pubished in 2014, in JSON format <https://dblp.dagstuhl.de/search/publ/api?q=fabien-mathieu%20year:2014&h=500&format=json>

By default DBLP only returns up to 30 records (see this link).

The default limit in DblpConnector is set to 9999.

It is possible to query a specific researcher using its DBLP-ID. The ID can be found by browsing the page related to a researcher.

Example: - Query using the DBLP name <https://dblp.uni-trier.de/pers/hd/c/Chen:Chung_Shue> - Query using the DBLP ID <https://dblp.org/pid/30/1446> - The DBLP ID can be obtained by clicking on the export bibliography icon.

For the moment, the only result format supported by DBLP is XML (see this link).

class DblpConnector(map_dblp_id: dict = None, map_dblp_name: dict = None, dblp_api_url: str = 'https://dblp.dagstuhl.de')[source]#

Bases: Connector

The DblpConnector class is a minifold gateway allowing to fetch data from DBLP (repository of scientific articles).

See also: - HalConnector. - GoogleScholarConnector.

Constructor.

Parameters:
  • map_dblp_id (dict) – Maps an author full name with his/her DBLP ID.

  • map_dblp_name (dict) – Maps an author obsolete full name with his/her current full name.

  • dblp_api_url (str) – The URL of the DBLP server. Defaults to DBLP_API_URL.

property api_url: str#

Retrieves the URL of the remote DBLP server. managed by this DblpConnector instance.

Returns:

The URL of the remote DBLP server.

attributes(object: str) set[source]#

Lists the attributes of the collection of objects stored in this DblpConnector instance.

Parameters:

object (str) – The name of the minifold object. As a DblpConnector instance stores a single collection, object is no relevant and you may pass None.

Returns:

The set of available object’s attributes

binary_predicate_to_dblp(p: BinaryPredicate, result: dict)[source]#

Converts a minifold predicate to a DBLP predicate.

Parameters:
  • p (BinaryPredicate) – A BinaryPredicate instance.

  • result (dict) – The output dictionary.

extract_entries(query: Query, results: list) list[source]#

Extracts the minifold entries for a DBLP query result.

Parameters:
  • query (Query) – A Query instance.

  • results – The DBLP results.

Returns:

A list of dict instances (the minifold entres).

property format: str#

Retrieves the format of data retrieved from the DBLP server.

Returns:

A value in {"xml", "json", "jsonp"}.

get_dblp_id(s: str) str[source]#

Retrieves the current DBLP ID of an author.

Parameters:

s (str) – The input fullname.

Returns:

The corresponding DBLP ID or s.

get_dblp_name(s: str) str[source]#

Retrieves the current DBLP fullname of an author.

Parameters:

s (str) – The input fullname.

Returns:

The (possibly updated) fullname.

property map_dblp_id: dict#

Retrieves the dictionary that maps an author full name with his/her DBLP ID.

Returns:

The queried dictionary.

property map_dblp_name: dict#

Retrieves the dictionary that maps an author obsolete full name with his/her current full name.

Returns:

The queried dictionary.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching query.

reshape_entries(query: Query, entries: list) list[source]#

Apply DblpConnector.reshape_entry() to a list of entries.

Parameters:
  • query (Query) – A Query instance.

  • entries (list) – A list of minifold entries.

Returns:

The reshaped entries.

reshape_entry(query: Query, entry: dict) dict[source]#

Reshape a DBLP entry to make it compliant with a given Query instance.

Parameters:
  • query (Query) – A Query instance.

  • entry (dict) – The entry to reshape.

Returns:

The reshaped entry.

static to_dblp_name(s: str, map_dblp_name: dict = None) str[source]#

Converts a fullname to the corresponding current DBLP name.

Parameters:

s (str) – The input fullname.

Returns:

The (possibly updated) fullname.

static to_doc_type(s: str) DocType[source]#

Converts a DBLP doc type to the corresponding DocType instance.

Parameters:

s (str) – The DBLP doc type.

Returns:

The corresponding DocType instance.

to_canonic_fullname(s: str) str[source]#

Canonizes an author full name.

Parameters:

s (str) – The input fullname.

Returns:

The canonizes fullname.

minifold.dict_util module#

freeze_dict(d: dict) dict[source]#

Freezes the values of a dictionary (for each value of type set).

Parameters:

d (dict) – The input dictionary.

Returns:

The freezed dictionary.

reverse_dict(d: dict) dict[source]#

Reverses a dictionary (swap its key and its values). Note that (key, value) pair may disappear if the values are not unique.

Parameters:

d (dict) – The input dictionary.

Returns:

The reversed dictionary.

minifold.doc_type module#

class DocType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: IntEnum

The DocType class enumerates the different kind of scientific publication.

See also:

  • HalConnector

  • DblpConnector

ARTICLE = 1#
BOOKS_AND_THESES = 0#
CHAPTER = 7#
COMMUNICATION = 2#
HDR = 6#
JOURNAL = 3#
PATENT = 8#
POSTER = 5#
REPORT = 4#
UNKNOWN = 9999#
doc_type_to_html(t: DocType, n: int = 1) str[source]#

Builds the HTML label related to a collection of homogeneous documents.

Parameters:
  • t (DocType) – The type of the documents.

  • n (int) – The number of document.

Returns:

The corresponding HTML label.

minifold.download module#

class DownloadConnector(map_url_out: dict, child: ~minifold.connector.Connector, downloads: callable = <function downloads>, extract_response: callable = <function extract_response>)[source]#

Bases: Connector

The DownloadConnector allows to fetch data over HTTP.

Constructor.

Parameters:
  • map_url_out (dict) – A dict(str: str) entry attribute containing an URL to another entry attribute that will store the contents provided by this URL. Example: {"url": "html_content"}

  • child (Connector) – The child Connector.

  • downloads (callable) – Callback(urls) -> dict(url: content) where urls is an iterable of URLs and where. the returned dict maps each urls and the corresponding response. Note: You could pass partial(download, ...) to customize the timeouts.

  • extract_response (callable) – Callback(response) -> str callback used to extract data from an HTTP query.

attributes(object: str) set[source]#

Lists available attributes related to a given collection of object stored in this DownloadConnector instance.

Parameters:

object (str) – The name of the collection of entries.

Returns:

The set of available attributes for object.

query(query: Query) list[source]#

Handles an input Query instance..

Parameters:

query – A Query instance.

Returns:

The corresponding results.

download(url: str, timeout: tuple = (1.0, 2.0), cache_filename: str = None)[source]#

Downloads the content related to a given URL. See also request_cache() to enable caching.

Parameters:
  • url (str) – A string containing the target URL.

  • timeout (tuple) – A (float, float) tuple corresponding to the (connect timeout, read timeout).

  • cache_filename (str) – A string containing path to the cache to use. Pass None to use the default cache.

Raises:
  • requests.exceptions.ConnectionError

  • requests.exceptions.ConnectTimeout

  • requests.exceptions.ContentDecodingError

  • requests.exceptions.ReadTimeout

  • requests.exceptions.SSLError

Returns:

The corresponding response.

downloads(*args) dict[source]#

Performs multiple downloads in parallel. See downloads_async for further details. See also request_cache() to enable caching.

Parameters:
  • urls (list) – An iterable over strings, each of them corresponding to an URL.

  • timeout (tuple) – A tuple (float, float) corresponding to the (connect timeout, read timeout).

  • return_exceptions (bool) – Pass True if this function is allowed to raise exceptions or must be quiet, False otherwise. Defaults to True.

Returns:

A dict({str : ?}} mapping for each queried URL the corresponding contents (if successful), the corresponding Exception otherwise.

async downloads_async(urls: list, timeout: tuple = (1.0, 2.0), return_exceptions: bool = True)[source]#

Asynchronous download procedure.

Parameters:
  • urls (list) – An iterable over strings, each of them corresponding to an URL.

  • timeout (tuple) – A tuple (float, float) corresponding to the (connect timeout, read timeout).

  • return_exceptions (bool) – Pass True if this function is allowed to raise exceptions or must be quiet, False otherwise. Defaults to True.

:raises See the download() function for the list of possible exceptions.:

Returns:

A dict({str : ?}} mapping for each queried URL the corresponding contents (if successful), the corresponding Exception otherwise.

extract_response(response: object, extract_text: bool = True) object[source]#

Extracts from a response the corresponding contents or Exception.

Parameters:
  • response (object) – A requests.Response or an Exception instance.

  • extract_text (bool) – A bool indicating if text must be extracted from HTML.

Returns:

The corresponding Exception or str.

now() str[source]#

Crafts a string storing the current timestamp.

Returns:

A string containing current timestamp.

trim_http(url: str) str[source]#

Discards the "http://" or `` “https://”`` prefix from an url.

Parameters:

url (str) – A str containing an URL.

Returns:

The str corresponding to the trimmed URL.

minifold.entries_connector module#

class EntriesConnector(entries: list)[source]#

Bases: Connector

EntriesConnector wraps a list of minifold entries (list of dictionaries)

Constructor.

Parameters:

entries (list) – A list of minifold entries.

attributes(obj: str = None) set[source]#

Lists available attributes related to a given collection of object stored in this EntriesConnector instance.

Parameters:

obj (str) – The name of the collection of entries. As EntriesConnector manages a single connection you may pass None.

Returns:

The set of available attributes.

property entries: list#

Accessor to the entries nested in this EntriesConnector instance.

Returns:

The nested entries.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input Query.

minifold.filesystem module#

This file gathers useful function to interact with the filesystem of the local storage.

check_writable_directory(directory: str)[source]#

Tests whether a directory is writable. If not, an expection is raised.

Parameters:

directory (str) – A String containing an absolute path.

Raises:

RuntimeError – If the directory does not exists or isn’t writable.

ctime(path: str) datetime[source]#

Retrieves the creation date of a file.

Parameters:

path (str) – A string containing the path of the file.

Returns:

The corresponding datetime.

find(dir_name: str) list[source]#

Lists the regular files in a stored in given directory or one of its subdirectories (in shell: find -type f dir_name).

Parameters:

dir_name (str) – A String corresponding to an existing directory.

Returns:

A list of strings, each of them corresponding to a file.

mkdir(directory: str)[source]#

Creates a directory (in shell: mkdir -p).

Parameters:

directory (str) – A String containing an absolute path.

Raises:

OSError – If the directory cannot be created.

mtime(path: str) datetime[source]#

Retrieves the modification date of a file.

Parameters:

path (str) – A string containing the path of the file.

Returns:

The corresponding datetime.

rm(path: str, recursive: bool = False)[source]#

Removes a file (in shell: rm -f).

Parameters:
  • path (str) – A string containing the path of the file to remove.

  • recursive (bool) – Pass True to remove recursively a directory (in shell: rm -rf).

minifold.for_each module#

class ForEachFilter(attribute_name: str, map_filter: dict)[source]#

Bases: object

The ForEachFilter class is useful when an minifold entry (i.e., a dictionary) maps a key with a value which is a list of sub-entries.

It filters each sub-entries that do not match the set of provided filters.

Constructor.

Parameters:
  • attribute_name (str) – The key of the minifold entry attribute mapped with a list of sub-entries.

  • map_filter (dict) – A dict(sub_key : filter) where ̀`sub_key` corresponds to an (optional) sub-entry key and filter is a callable(object) -> bool returning True if and only if the value mapped to ̀`sub_key` indicates that the sub-entry must be filtered, False otherwise.

match_filters(sub_entry: dict) bool[source]#

Checks whether a sub-entry is filtered according to this ForEachFilter instance.

Parameters:

sub_entry (dict) – The sub-entry to be checked.

Returns:

True if sub_entry is not filtered, False otherwise.

for_each_sub_entry(entry, attribute: str, map_lambda: dict) dict[source]#

Applies a set of lambda functions to the sub-entries carried by a minifold entry (key, value) pair.

Parameters:
  • attribute (str) – The key of entry carrying the sub-entries to be processed.

  • map_lambda (dict) – A dict(sub_key : transform) where: sub_key is an (optional) key involved in a sub-entry; transform is a function transforming the value mapped to sub_key to its new value.

Returns:

The modified entry.

minifold.google_scholar module#

This file integrates the classes provided in scholar.py to minifold. It allows to query Google scholar.

class GoogleScholarConnector(citation_format: str = 4)[source]#

Bases: Connector

The GoogleScholarConnector class is a minifold gateway allowing to query Google scholar <https://scholar.google.com/>.

See also: - DblpConnector. - GoogleScholarConnector.

Constructor.

Parameters:

citation_format (str) – The citation format.

attributes(object: str) set[source]#

Lists available attributes related to a given collection of object stored in this GoogleScholarConnector instance.

Parameters:

object (str) – The name of the collection of entries. The only supported object is: "publication" and "cluster".

Returns:

The set of available attributes for object.

static filter_to_scholar(p: BinaryPredicate, gs_query: SearchScholarQuery, authors: list)[source]#

Converts a minifold predicate (applying to authors) to the corresponding scholar filter (recursive function).

Parameters:
  • p (BinaryPredicate) – The minifold filter.

  • gs_query (ScholarQuery) – The Google Scholar query.

  • authors (list) – Pass an empty list.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input Query.

static sanitize_author(authors: list, author: str) str[source]#

Fixes author names, by finding in an input list of strings the closest string with the input string.

Parameters:
  • authors (list) – The list of strings.

  • author (str) – The reference string.

Returns:

The string of authors closest to author if any, else author

class MinifoldScholarQuerier[source]#

Bases: ScholarQuerier

:py:class`MinifoldScholarQuerier` overloads ScholarQuerier to fetch more attributes.

Constructor.

parse(s_html: str)[source]#

Populates self.articles using the HTML Google Scholar page.

Parameters:

s_html (str) – An HTML Google Scholar page content.

Raises:

RuntimeError` if the result can't be fetched from Google scholar

send_query(gs_query: ScholarQuery)[source]#

Sends a query to Google scholar.

Parameters:

gs_query (ScholarQuery) – A Google scholar query.

Raises:

RuntimeError` if the result can't be fetched from Google scholar

parse_article(s_html: str) dict[source]#

Parse a “gs_res_ccl_mid” div (wrapping each article) returned by Google Scholar.

Parameters:

s_html (str) – The HTML string containing the div.

Returns:

  • “authors” : list, The list only contains the first authors. The last author name maybe incomplete.

  • ”cluster_id” : int,

  • ”conference” : str, # May be unset

  • ”editor” : str,

  • ”excerpt” : str, # May be unset

  • ”num_citations” : int,

  • ”num_versions” : int,

  • ”title” : str,

  • ”url_citations” : str,

  • ”url_versions” : str,

  • ”url_pdf” : str,

  • ”url_title” : str,

  • ”year” : int,

An incomplete string may start by '…' and ends with '…'. URLs are absolute.

Return type:

The dict describing the article, structured as follows

minifold.group_by module#

class GroupByConnector(attributes: list, child: Connector)[source]#

Bases: Connector

The GroupByConnector class implements the GROUP BY statement in a minifold pipeline.

Constructor.

Parameters:
  • attributes (list) – The list of entry keys used to form the aggregates.

  • child (Connector) – The child minifold Connector instance.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this GroupByConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child: Connector#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

group_by(attributes: list, entries: list) dict[source]#

Implements the GROUP BY statement for a list of minifold entries.

Parameters:
  • attributes (list) – The list of entry keys used to form the aggregates.

  • entries (list) – A list of minifold entries.

Returns:

A dictionary where each key identifies an aggregate and is mapped to the corresponding entries.

group_by_impl(functor: ValuesFromDictFonctor, entries: list) dict[source]#

Implementation details of group_by().

Parameters:
  • functor (ValuesFromDictFonctor) – The functor allowing to extract the values used to form the aggregates.

  • entries (list) – A list of minifold entries.

Returns:

A dictionary where each key identifies an aggregate and is mapped to the corresponding entries.

minifold.hal module#

The HAL Computer science bibliography provides open bibliographic information on major computer science journals and proceedings.

Some query examples:

  • Retrieves LINCS publication since 2012 <https://api.archives-ouvertes.fr/search/?q=structId_i:(160294)&fq=producedDateY_i:[2012%20TO%20*]>

  • Retrieves articles authored by Fabien Mathieu since 2012<https://api.archives-ouvertes.fr/search/?q=*:*&fq=authFullName_s:(%22Fabien%20Mathieu%22)&fq=producedDateY_i:[2012%20TO%20*]&&fl=title_s&sort=submittedDate_tdate+desc&wt=json>

  • Retrieves LINCS publication in JSON for a subset of attributes, sorted from the newest one to the oldest one <https://api.archives-ouvertes.fr/search/?q=structId_i:(160294)&fq=&rows=999&fl=keyword_s,producedDateY_i,authFullName_s,*itle_s,abstract_s,docType_s&sort=submittedDate_tdate+desc&wt=json>

It is possible to query a specific researcher using its HAL-ID.

class HalConnector(map_hal_id: dict = None, map_hal_name: dict = None, hal_api_url: str = 'https://api.archives-ouvertes.fr/search')[source]#

Bases: Connector

The DblpConnector class is a minifold gateway allowing to fetch data from DBLP (repository of scientific articles).

See also: - DblpConnector. - GoogleScholarConnector.

Constructor.

Parameters:
  • map_hal_id (dict) – A dictionary that maps some researcher names to their corresponding HAL-ID.

  • map_hal_name (dict) – A dictionary that maps some researcher names to their name in HAL.

property api_url: str#

Retrieves the HAL repository URL of this HalConnector instance.

Returns:

The HAL repository URL.

attributes(object: str) set[source]#

Lists available attributes related to a given collection of minifold entries exposed by this Connector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

static binary_predicate_to_hal(p: BinaryPredicate) str[source]#

Converts a minifold predicate to the corresponding HAL URL predicate.

Parameters:

p (BinaryPredicate) – The minifold predicate.

Returns:

The corresponding HAL URL predicate.

property format: str#

Retrieves the format of the HAL results of this HalConnector instance.

Returns:

The format of the HAL results (e.g., "json").

property map_hal_id: dict#

Retrieves the dictionary mapping researcher names to the corresponding HAL ID of this HalConnector instance.

Returns:

The dictionary mapping researcher names to the corresponding HAL ID of this HalConnector instance.

property map_hal_name: dict#

Retrieves the dictionary mapping researcher names to the corresponding HAL name of this HalConnector instance.

Returns:

The dictionary mapping researcher names to the corresponding HAL name of this HalConnector instance.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

query_to_hal(q: Query) str[source]#

Converts a minifold query to a HAL URL query.

Parameters:

q (Query) – The minifold query.

Returns:

The corresponding HAL URL.

static quote(s: str) str[source]#

Quotes a string to encode it in an HAL URL.

Parameters:

s (str) – The string to be quoted.

Returns:

The quoted string.

static sanitize_dict(d: dict) dict[source]#

Reshapes a dictionary obtained from a JSON HAL result.

Parameters:

d (dict) – The JSON HAL dictionary.

Returns:

The reshape dictionary.

sanitize_entries(entries: list) list[source]#

Reshapes a collection of raw minifold entries related to HAL.

Parameters:

entries (dict) – The input minifold entries.

Returns:

The reshaped minifold entries.

sanitize_entry(entry: dict) dict[source]#

Reshapes a raw minifold entry related to HAL.

Parameters:

entry (dict) – The input minifold entry.

Returns:

The reshaped minifold entry.

static string_to_hal(s: str) str[source]#

Converts an abritary string to its corresponding HAL URL string.

Parameters:

s (str) – The input string.

Returns:

The converted string.

static to_doc_type(s: str) DocType[source]#

Converts a HAL document type to the corresponding DocType value.

Parameters:

s (str) – The HAL document type.

Returns:

THe corresponding DocType value.

minifold.hash module#

Internals to make hashable object that are not hashable.

This is useful, e.g, in the UniqueConnector class.

to_hashable(x: object) object[source]#

Converts an object to an hashable tuple.

Parameters:

x (object) – The input object.

Returns:

The corresponding hashable object.

minifold.html module#

This file gathers some utilities to process, render, or make HTML strings in minifold.

connector_to_html(connector: Connector, **kwargs) str[source]#
dict_to_html(d: dict, attributes: list, map_attribute_label: dict = None) str[source]#
entries_to_html(entries: list, map_attribute_label: dict = None, attributes: list = None, keep_entry_if: callable = None) str[source]#

Exports to HTML a list of dict.

Parameters:
  • entries – A list of dicts

  • map_attribute_label – A dict{str: str} which maps each entry key with the column header to display.

  • attributes – The subset of keys to display.

  • keep_entry_if – Callback allowing to filter some entries

Returns:

The corresponding HTML string.

entry_to_html(entry: dict, map_attribute_label: dict = None, attributes: list = None) str[source]#
html(s: str)[source]#

Evaluates HTML code in a Jupyter Notebook.

Parameters:

s – A str containing HTML code.

html_to_text(s_html: str, blacklist: set = None) str[source]#

Converts an HTML page to text, by discarding javascript and css related to the site.

Parameters:
  • s_html (str) – A str containing HTML.

  • blacklist (set) – A set of string (lowercase) corresponding to HTML tags that must be ignored.

Returns:

The corresponding text.

print_error(x: object)[source]#

Prints an error in a Jupyter Notebook.

Parameters:

x – An expection.

remove_all_attrs_except_saving(soup: BeautifulSoup, whitelist: dict = None)[source]#

Removes all attributes except some.

Parameters:
  • soup (BeautifulSoup) – A BeautifulSoup instance, modified in place.

  • whitelist – A dict{tag: list(attr)} where tag is an HTML tag and attr an HTML attribute.

remove_tags(soup: BeautifulSoup, blacklist: set = None)[source]#

Removes some HTML tags.

Parameters:
  • soup (BeautifulSoup) – A BeautifulSoup instance, modified in place.

  • blacklist (set) – A list of str, where each str is an HTML tag.

sanitize_html(s_html: str, blacklist: set = None, remove_attrs: bool = True) str[source]#

Removes from an HTML string irrelevant HTML blocks and attributes. Warning: This function is SLOW so do not use it on large corpus!

Parameters:
  • s_html (str) – A str instance containing HTML.

  • blacklist (set) – List of blacklisted HTML tags.

  • remove_attrs (bool) – Pass True to remove HTML tag attributes.

Returns:

The sanitized string.

value_to_html(x: object) str[source]#

Converts a value to its corresponding HTML string.

Parameters:

x (object) – The input value.

Returns:

The corresponding HTML string.

minifold.html_table module#

class HtmlTableConnector(filename: str, columns: list, keep_entry: callable = None)[source]#

Bases: EntriesConnector

The HtmlTableConnector class is a minifold gateway allowing to fetch data stored in an HTML table.

Constructor.

Parameters:
  • filename (str) – Input HTML filename.

  • columns (list) – list of string mapping the attribute name corresponding with the index. If data is fetch for columns having a greater index than len(columns), columns[-1] is used, and this key may store a list of string values instead of a single string. This allow to store data stored among several columns in a single attribute.

  • keep_entry (callable) – callback which determine whether an must entry must be kept or discard. Pass None to filter nothing. This is the opportunity to discard a header or irrelevant row.

attributes(object: str = None) set[source]#

Lists available attributes related to a given collection of minifold entries exposed by this Connector instance.

Parameters:

object (str) – The name of the collection. As this connector stores a single collection, you may pass None. Defaults to None.

Returns:

The set of corresponding attributes.

class HtmlTableParser(columns: list, output_list: list, keep_entry: callable = None)[source]#

Bases: HTMLParser

The HtmlTableParser class extracts the values from an HTML table.

Constructor.

Parameters:
  • columns (list) – A list of string mapping the attribute name corresponding with the index. If data is fetch for columns having a greater index than len(columns), columns[-1] is used, and this key may store a list of string values instead of a single string. This allow to store data stored among several columns in a single attribute.

  • output_list (list) – A reference to an output list where the data will be output (one dict per row, one key/value per column).

  • keep_entry (callable) – A callback which determines whether an must entry must be kept or discarded. Pass None to filter nothing. This is the opportunity to discard a header or irrelevant rows.

error(message: str)[source]#

Logs an error message.

Parameters:

message (str) – The error message.

handle_data(data: str)[source]#

Callback that handles an opening HTML data.

Parameters:

data (str) – The HTML data (here, stored in a table cell).

handle_endtag(tag)[source]#

Callback that handles an opening HTML tag (e.g., "</td>").

Parameters:

tag (str) – The HTML tag. _Example:_ "td".

handle_starttag(tag: str, attrs: str)[source]#

Callback that handles an opening HTML tag (e.g., "<td>").

Parameters:
  • tag (str) – The HTML tag. _Example:_ "td".

  • attrs (str) – The HTML tag attributes.

html_table(filename: str, columns: list, keep_entry: bool = None) list[source]#

Loads an HTML table from an input file

Parameters:
  • filename (str) – The path to the input HTML file.

  • columns (list) – A list of string mapping the attribute name corresponding with the index. If data is fetch for columns having a greater index than len(columns), columns[-1] is used, and this key may store a list of string values instead of a single string. This allow to store data stored among several columns in a single attribute.

  • keep_entry (callable) – A callback which determines whether an must entry must be kept or discarded. Pass None to filter nothing. This is the opportunity to discard a header or irrelevant rows.

Returns:

The corresponding list of minifold entries.

minifold.ipynb module#

This file gathers utilities related to ipython Jupyter notebooks.

in_ipynb() bool[source]#

Tests whether the code is running inside a Jupyter Notebook.

Returns:

True iff the code is running inside a Jupyter Notebook.

minifold.join_if module#

class JoinIfConnector(left: Connector, right: Connector, join_if: callable, mode: int = 1)[source]#

Bases: Connector

The JoinIfConnector is a minifold connector that implements the INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL OUTER JOIN statements in a minifold pipeline.

Constructor.

Parameters:
  • left (Connector) – The left Connector child.

  • right (Connector) – The right Connector child.

  • join_if (callable) – The callback implementing the join criterion.

  • mode (int) – The type of join. The valid values are: INNER_JOIN, LEFT_JOIN, RIGHT_JOIN, FULL_OUTER_JOIN.

attributes(object: str)[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this JoinIfConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property join_if: callable#

Retrieves the functor to join entries in this JoinIfConnector instance.

Returns:

The left Connector child.

property left: Connector#

Retrieves the left Connector child in this JoinIfConnector instance.

Returns:

The left Connector child.

property mode: int#

Retrieves the join mode used in this JoinIfConnector instance.

Returns:

  • INNER_JOIN

  • LEFT_JOIN

  • RIGHT_JOIN

  • FULL_OUTER_JOIN

Return type:

A value among

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Raises:

ValueError

Returns:

The list of entries matching the input query.

property right: Connector#

Retrieves the right Connector child in this JoinIfConnector instance.

Returns:

The left Connector child.

are_joined_if(l_entry: dict, r_entry: dict, f: callable) bool[source]#

Internal function, used to check whether two dictionaries can be joined according to a functor.

Parameters:
  • l_entry (dict) – The dictionary corresponding to the left operand.

  • r_entry (dict) – The dictionary corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined, False otherwise.

Raises:

RuntimeError – on keys that are missing in l or r.

Returns:

True if and only if l and r can be joined, False otherwise.

full_outer_join_if(l_entries: list, r_entries: list, f: callable, match_once: bool = True) list[source]#

Computes the FULL OUTER JOIN of two lists of minifold entries.

Parameters:
  • l_entries (dict) – The minifold entries corresponding to the left operand.

  • r_entries (dict) – The minifold entries corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined (where l and r are two minifold entries), False otherwise.

  • match_once (bool) – Pass True if a left entry must be matched at most once.

Returns:

The corresponding list of entries.

inner_join_if(l_entries: list, r_entries: list, f: callable, match_once: bool = True, merge: callable = <function merge_dict>) list[source]#

Computes the INNER JOIN of two lists of minifold entries.

Parameters:
  • l_entries (dict) – The minifold entries corresponding to the left operand.

  • r_entries (dict) – The minifold entries corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined (where l and r are two minifold entries), False otherwise.

  • match_once (bool) – Pass True if a left entry must be matched at most once.

  • merge (callable) – A function that merges two input dictionaries. Defaults to merge_dict().

Returns:

The corresponding list of entries.

join_mode_to_string(mode: int) str[source]#

Convert a *_JOIN constant to the corresponding string representation.

Args:

left_join_if(l_entries: list, r_entries: list, f: callable, match_once: bool = True, merge: callable = <function merge_dict>) list[source]#

Computes the LEFT JOIN of two lists of minifold entries.

Parameters:
  • l_entries (dict) – The minifold entries corresponding to the left operand.

  • r_entries (dict) – The minifold entries corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined (where l and r are two minifold entries), False otherwise.

  • match_once (bool) – Pass True if a left entry must be matched at most once.

  • merge (callable) – A function that merges two input dictionaries. Defaults to merge_dict().

Returns:

The corresponding list of entries.

merge_dict(l_entry: dict, r_entry: dict) dict[source]#

Merges two dictionaries. The input dictionaries are not altered.

Example

>>> merge_dict({"a": 1, "b": 2}, {"b": 3, "c": 4})
{'a': 1, 'b': 3, 'c': 4}
Parameters:
  • l_entry (dict) – A dictionary.

  • r_entry (dict) – Another dictionary.

Returns:

The dictionary obtained by merging l and r.

right_join_if(l_entries: list, r_entries: list, f: callable, match_once: bool = True) list[source]#

Computes the RIGHT JOIN of two lists of minifold entries.

Parameters:
  • l_entries (dict) – The minifold entries corresponding to the left operand.

  • r_entries (dict) – The minifold entries corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined (where l and r are two minifold entries), False otherwise.

  • match_once (bool) – Pass True if a left entry must be matched at most once.

Returns:

The corresponding list of entries.

minifold.json module#

class JsonConnector(json_data: str, extract_json: callable = <function identity>)[source]#

Bases: EntriesConnector

The JsonConnector class is a gateway to a JSON string.

Constructor.

Parameters:
  • json_data – A JSON string.

  • extract_json – A function that converts the JSON data to a list of minifold entries. Defaults to identity().

class JsonFileConnector(json_filename: str, extract_json: callable = <function identity>)[source]#

Bases: JsonConnector

The JsonConnector class is a gateway to a JSON file.

Constructor.

Parameters:
  • json_filename – The path of the input JSON file.

  • extract_json – A function that converts the JSON data to a list of minifold entries. Defaults to identity().

identity(x: object) list[source]#

Identity function.

Parameters:

x (object) – An arbitrary object.

Returns:

The x object.

load_json_from_file(json_filename: str, extract_json: callable = <function identity>) list[source]#

Loads minifold from a JSON file.

Parameters:
  • json_filename (str) – The path to a JSON file.

  • extract_json (callable) – A function that reprocesses the python structure resulting from the JSON to produce a list of minifold entries. Defaults to identity().

Returns:

A list of minifold entries.

load_json_from_str(json_data: str, extract_json: callable = <function identity>) list[source]#

Loads minifold from a JSON string.

Parameters:
  • json_data (str) – The JSON data.

  • extract_json (callable) – A function that reprocesses the python structure resulting from the JSON to produce a list of minifold entries. Defaults to identity().

Returns:

A list of minifold entries.

minifold.lambdas module#

class LambdasConnector(map_lambdas: dict, child: Connector, map_dependencies: dict = None)[source]#

Bases: Connector

The LambdasConnector class is used to apply the lambdas function in the middle of a minifold pipeline. It allows to craft or reshape a flow of minifold entries on-the-fly.

Constructor.

Parameters:
  • map_lambdas (dict) – A dictionary that maps key (existing or new) key attributes with a function processing an input entry.

  • child (Connector) – The child minifold Connector instance.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this LambdasConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child: Connector#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

find_needed_attributes(attributes: set) set[source]#

Finds the attribute that are needed to process a set of attributes of interest.

Parameters:

attributes (set) – The attributes of interest.

Returns:

The corresponding needed attributes.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

find_lambda_dependencies(func: callable) set[source]#

Infers the keys needed by a function processing a dictionary to not trigger KeyError exception.

Parameters:

func (callable) – A function taking a dictionary in parameter.

Returns:

The keys needed by func to process a dictionary.

find_lambdas_dependencies(map_lambdas: dict) dict[source]#

Infers the keys needed by several functions that outputs a specific entry key when processing an input entry so that no KeyError exception is raised.

Parameters:

map_lambdas (dict) – A dictionary that maps key (existing or new) key attributes with a function processing an input entry.

Returns:

The keys needed by the functions involved in map_lambdas.values() to process a minifold entry

lambdas(map_lambdas: dict, entries: list, attributes: set = None) list[source]#

Be sure that the result is deterministic without regards each lambda is processed.

Examples

>>> map_lambdas = {"x": lambda e: 10 + e["x"]} # OK
>>> map_lambdas = {"x": lambda e: 10 + e["x"] + e["y"]} # OK
>>> map_lambdas = {
...     "x": lambda e: 10 + e["x"] + e["y"],
...     "y": lambda e: 10 + e["y"]
... } # not OK because e["y"] is ambiguous.

minifold.ldap module#

class LdapConnector(ldap_host: str, ldap_user: str = None, ldap_password: str = None, ldap_use_ssl: bool = None)[source]#

Bases: Connector

Constructor.

Parameters:
  • ldap_host (str) – The FQDN or the IP of the server (e.g., “my-ldap.firm.com”).

  • ldap_user (str) – The LDAP login used to connect to the server.

  • ldap_password (str) – The LDAP password of ldap_user.

  • ldap_use_ssl (bool) – Pass True if the connection to the server must be established using SSL, False or None otherwise.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this LdapConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

static binary_predicate_to_ldap(p: BinaryPredicate) str[source]#

Converts a BinaryPredicate to the corresponding LDAP predicate.

Parameters:

p (BinaryPredicate) – A BinaryPredicate instance.

Returns:

The corresponding LDAP predicate.

static literal_from_ldap(b: bytes) str[source]#

Converts a LDAP literal to the corresponding string.

Parameters:

b (bytes) – The input literal.

Returns:

The corresponding string.

static operand_to_ldap(operand: str) str[source]#

Converts a minifold operand to the corresponding LDAP operand.

Parameters:

operand (str) – The operand to be converted.

Returns:

The corresponding string.

static operator_to_ldap(op) str[source]#

Converts a minifold operator to the corresponding LDAP operator.

Parameters:

operator – The operator to be converted.

Returns:

The corresponding LDAP string.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

static sanitize_dict(d: dict) dict[source]#

Reshapes the dictionary returned by a LDAP query to a minifold entry.

Parameters:

d (dict) – The input dict.

Returns:

The corresponding sanitized dictionary.

minifold.lexical_cast module#

cast_bool(s: str) bool[source]#

Casts a string to a bool if possible, raises an exception otherwise.

Parameters:

s (str) – The string to be casted.

Raises:

ValueError`, if the cast cannot be achieved

Returns:

The boolean corresponding to s if successful.

cast_none(s: str) None[source]#

Casts a string to None if possible, raises an exception otherwise.

Parameters:

s (str) – The string to be casted.

Raises:

ValueError – if the cast cannot be achieved.

Returns:

None if successful.

lexical_cast(s: str, cast: callable) object[source]#

Casts a string according to an operator. See also the lexical_casts() function.

Parameters:
  • s (str) – The string to be casted.

  • cast (callable) – A single cast operator. _Examples:_ cast_bool(), cast_none(), int(), float(), etc.

Raises:

ValueError`, if the cast cannot be achieved

Returns:

The corresponding value.

lexical_casts(s: str, cast_operators: list = None) object[source]#

Casts a string according to several cast operators. See also the lexical_cast() function.

Parameters:
  • s (str) – The string to be casted.

  • cast_operators – A list of cast operators. Operators must be ordered by decreasing strictness (e.g. int should preceed float). Pass None to use the default list of cast operators.

Returns:

The original string if no cast worked, the corresponding casted value otherwise.

minifold.limit module#

class LimitConnector(child, lim: int)[source]#

Bases: Connector

The LimitConnector class implements the LIMIT statement in a minifold pipeline.

Constructor.

Parameters:
  • child (Connector) – The child minifold Connector instance.

  • lim (int) – A positive integer, limiting the number of entries to return. Pass None if there is no limit.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this LimitConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

property limit: int#

Accessor to the limit size.

Returns:

The limit size or None (no limit).

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

limit(entries: list, lim: int) list[source]#

Implements the LIMITstatement for a list of minifold entries.

Example

>>> where([{"a": 1, "b": 1}, {"a": 2, "b": 2}, {"a": 3, "b": 3}], 2)
[{'a': 1, 'b': 1}, {'a': 2, 'b': 2}]
Parameters:
  • entries (list) – A list of minifold entries.

  • lim (int) – A positive integer, limiting the number of entries to return. Pass None if there is no limit.

Returns:

The kept entries.

minifold.log module#

class Log[source]#

Bases: object

The Log enables logging in Minifold.

Example

>>> Log.enable_print = True
>>> Log.info("hello")
classmethod debug(s: str)[source]#

Logs a debug message.

Parameters:
  • cls (Log) – The Log class.

  • s (str) – The string to log.

static default_style() str[source]#

Crafts a shell escape sequence to set the default style

enable_print = False#
classmethod error(s: str)[source]#

Logs an error message.

Parameters:
  • cls (Log) – The Log class.

  • s (str) – The string to log.

classmethod info(s: str)[source]#

Logs an info message.

Parameters:
  • cls (Log) – The Log class.

  • s (str) – The string to log.

log_level = 0#
message_color = {0: 6, 1: 2, 2: 3, 3: 1}#
message_header = {0: 'DEBUG', 1: 'INFO', 2: 'WARNING', 3: 'ERROR'}#
classmethod print(message_type: int, message: str, file=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>)[source]#

Internal method, used to orints a message with a custom header and style.

Parameters:
  • message_type (int) – A value in INFO, DEBUG, WARNING, ERROR.

  • message (str) – The message to log.

  • file – The output stream. Defaults to sys.stderr.

static start_style(fg_color: int = None, bg_color: int = None, styles: list = None) str[source]#

Crafts a shell escape sequence to start a style.

Parameters:
  • fg_color (int) – The integer identifying the foreground color to be used. _Example:_ GREEN.

  • bg_color (int) – The integer identifying the background color to be used. _Example:_ PINK.

  • styles (list) – _Example:_ [UNDERLINED, BLINKING].

Returns:

The corresponding shell string.

classmethod warning(s: str)[source]#

Logs a warning message.

Parameters:
  • cls (Log) – The Log class.

  • s (str) – The string to log.

with_color = True#

minifold.mongo module#

class MongoConnector(mongo_url: str, db_name: str)[source]#

Bases: Connector

The MongoConnector is a minifold gateway allowing to manipulate data stored in a Mongo database.

Constructor.

Parameters:
  • mongo_url (str) – The URL of the mongo database.

  • db_name (str) – The name of the queried database.

attributes(obj: str = None) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this MongoConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

connect(mongo_url: str) MongoClient[source]#

Connects to a Mongo database.

Parameters:

mongo_url (str) – The URL of the mongo database.

Returns:

py:class`MongoClient` instance.

Return type:

The corresponding

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

use_database(db_name: str)[source]#

Selects the active database.

Parameters:

db_name (str) – The name of the queried database.

minifold.natural_join module#

class NaturalJoinConnector(left: Connector, right: Connector)[source]#

Bases: Connector

The NaturalJoinConnector is a minifold connector that implements the NATURAL JOIN statement in a minifold pipeline.

Constructor.

Parameters:
  • left (Connector) – The left Connector child.

  • right (Connector) – The right Connector child.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this JoinIfConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property left: Connector#

Retrieves the left Connector child in this JoinIfConnector instance.

Returns:

The left Connector child.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

property right: Connector#

Retrieves the right Connector child in this JoinIfConnector instance.

Returns:

The left Connector child.

are_naturally_joined(l_entry: dict, r_entry: dict) bool[source]#

Internal function, used to check whether two dictionaries can be joined using an NATURAL JOIN statement.

Parameters:
  • l_entry (dict) – The dictionary corresponding to the left operand.

  • r_entry (dict) – The dictionary corresponding to the right operand.

Returns:

True if and only if l and r can be joined, False otherwise (i.e., l and r have no common key).

natural_join(l_entries: list, r_entries: list) list[source]#

Computes the NATURAL JOIN of two lists of minifold entries.

Parameters:
  • l_entries (dict) – The minifold entries corresponding to the left operand.

  • r_entries (dict) – The minifold entries corresponding to the right operand.

  • f (callable) – A functor such that f(l, r) returns True if and only if l and r can be joined (where l and r are two minifold entries), False otherwise.

  • match_once (bool) – Pass True if a left entry must be matched at most once.

  • merge (callable) – A function that merges two input dictionaries. Defaults to merge_dict().

Returns:

The corresponding list of entries.

minifold.proxy module#

class Proxy(*args, **kwargs)[source]#

Bases: dict

The Proxy class is a singleton dictionary storing proxy settings. It typically maps each protocol (e.g., “http”, “https”) with the corresponding proxy URL (e.g., “http://localhost:8080”).

>>> proxy = Proxy()
>>> proxy_enable("localhost", 8080)
>>> proxy_disable()
make_session() Session[source]#

Creates a requests.Session instance according to the Proxy singleton.

Returns:

The corresponding requests.Session instance.

proxy_disable()[source]#

Disables the Proxy singleton.

proxy_enable(host: str, port: int, protocols: list = None)[source]#

Enables the Proxy singleton.

Parameters:
  • host (str) – The proxy FQDN or IP address.

  • port (int) – The proxy port.

  • protocols (list) – The list of protocols supported by the proxy. Passing None is equivalent to passing ["http", "https"]. Defaults to None.

proxy_enable_localhost()[source]#

Illustrative function enabling a local proxy (127.0.0.1:8080).

minifold.query module#

class Query(action: int = 1, object: str = '', attributes: list = None, filters: object = None, offset: int = None, limit: int = None, sort_by: dict = None)[source]#

Bases: object

Constructor.

Parameters:
  • action (int) – A value in ACTION_CREATE (in SQL, INSERT queries), ACTION_READ (in SQL, SELECT queries), ACTION_UPDATE (in SQL, UPDATE queries), ACTION_DELETE or None (in SQL, DELETE queries). Passing None is equivalent to ACTION_CREATE. Defaults to None.

  • object (str) – The queried object collection (some minifold gateways may host several entries collections, identified by a name). Defaults to "".

  • attributes (list) – A list of attributes. In SQL, this corresponds to the attributes pass to SELECT.

  • filters (object) – A minifold filter. In SQL, this corresponds to the WHERE clause. See also the BinaryPredicate`a and the :py:class:`SearchFilter classes.

  • offset (int) – A positive integer or None if not needed. In SQL this corresponds to the OFFSET statement.

  • limit (int) – A positive integer or None if not needed. In SQL this corresponds to the LIMIT statement.

  • sort_by (dict) – A dictionary characterizing how to sort the results. It maps each attributes to be sorted with the corresponding sorting order (SORT_ASC or SORT_DESC). In SQL this corresponds to the SORT BY statement.

property action: int#
property attributes: list#
copy()[source]#

Copies this Query instance.

Returns:

The copied Query instance.

property filters#
property limit: int#
property object: str#
property offset: int#
property sort_by: list#
action_to_str(action: int) str[source]#

Converts an Query action to its corresponding string.

Parameters:

action (int) – A value in ACTION_CREATE, ACTION_READ, ACTION_UPDATE, ACTION_DELETE.

Returns:

The corresponding string.

minifold.rename module#

class RenameConnector(mapping: dict = None, child: Connector = None)[source]#

Bases: Connector

The RenameConnector class wraps the rename() function to exploit it in a minifold query plan.

Constructor.

Parameters:
  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key (overlying to underlying ontology). Note that mapping must be reversible, i.e. reverse_dict(reverse_dict(d)) == d.

  • child (Connector) – The child minifold Connector instance.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this RenameConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

property map_qr#

Accessor to the renaming mapping (overlying to underlying).

Returns:

The corresponding mapping.

property map_rq#

Accessor to the renaming mapping (underlying to overlying).

Returns:

The corresponding mapping.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

rename(mapping: dict, entries: list) list[source]#

Replaces several keys (possibly) involved in a list of minifold entries.

Example

>>> rename({"a": "A", "b": "B"}, [{"a": 1, "b": 2, "c": 3}, {"a": 10, "b": 20, "c": 30}])
[{'c': 3, 'A': 1, 'B': 2}, {'c': 30, 'A': 10, 'B': 20}]
Parameters:
  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key.

  • entries (list) – A list of minifold entries, updated in place.

Returns:

The updated entries.

rename_entry(d: dict, mapping: dict) dict[source]#

Replaces several keys (possibly) involved in a dictionary.

Example

>>> rename_entry({"a": 1, "b": 2, "c": 3}, {"a": "A", "b": "B"})
{'c': 3, 'A': 1, 'B': 2}
Parameters:
  • d (dict) – The input dictionary, updated in place.

  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key.

Returns:

The updated dictionary.

rename_filters(filters: object, mapping: dict)[source]#

Rename the keys involved in a list of minifold filters (see BinaryPredicate), typically the WHERE part of a Query instance. Recursive function. See the rename_query() function.

Parameters:
  • filters (object) – The minifold filters, updated in place.

  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key.

Returns:

The updated filters.

rename_key(d: dict, old_key: str, new_key: str) dict[source]#

Replaces a key (possibly) involved in a dictionary.

Example

>>> rename_key({"a": 1, "b": 2, "c": 3}, "a", "A")
{'b': 2, 'c': 3, 'A': 1}
Parameters:
  • d (dict) – The input dictionary, updated in place.

  • old_key (str) – The key to be updated.

  • new_key (str) – The new key.

Returns:

The updated dictionary.

rename_list(values: list, mapping: dict) list[source]#

Replaces some values involved in a list.

Example

>>> rename_list(["a", "b", "a", "c"], {"a": "A", "b": "B"})
['A', 'B', 'A', 'c']
Parameters:
  • values (list) – The input list, modified in place.

  • mapping (dict) – A dictionary mapping each value to be replaced by the new corresponding value.

Returns:

The updated list.

rename_query(q: Query, mapping: dict) Query[source]#

Rename some attributes involved in a Query instance. This is especially useful when a minifold pipeline involves several data sources using different naming conventions for a same attribute.

Parameters:
  • q (Query) –

  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key.

Returns:

The renamed minifold query.

rename_sort_by(sort_by: dict, mapping: dict) dict[source]#

Rename the SORT BY part of a Query instance. See the rename_query() function.

Parameters:
  • sort_by (dict) – The SORT BY part of a Query instance.

  • mapping (dict) – A dictionary mapping each key to be replaced by the new corresponding key.

Returns:

The dictionary corresponding to sort_by after renaming.

minifold.request_cache module#

install_cache(cache_filename: str = None)[source]#

Enables requests_cache for minifold, hence allowing to cache HTTP queries issued by minifold.

Parameters:

cache_filename (str) – The path to the minifold cache. You may pass None to use the default path (i.e., ~/.minifold/cache/requests_cache under Linux).

minifold.scholar module#

This module provides classes for querying Google Scholar and parsing returned results. It currently only processes the first results page. It is not a recursive crawler.

class ClusterScholarQuery(cluster=None)[source]#

Bases: ScholarQuery

This version just pulls up an article cluster whose ID we already know about.

SCHOLAR_CLUSTER_URL = 'http://scholar.google.com/scholar?cluster=%(cluster)s%(num)s'#
get_url()[source]#

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_cluster(cluster)[source]#

Sets search to a Google Scholar results cluster ID.

exception Error[source]#

Bases: Exception

Base class for any Scholar error.

exception FormatError[source]#

Bases: Error

A query argument or setting was formatted incorrectly.

exception QueryArgumentError[source]#

Bases: Error

A query did not have a suitable set of arguments.

class ScholarArticle[source]#

Bases: object

A class representing articles listed on Google Scholar. The class provides basic dictionary-like behavior.

as_citation()[source]#

Reports the article in a standard citation format. This works only if you have configured the querier to retrieve a particular citation export format. (See ScholarSettings.)

as_csv(header=False, sep='|')[source]#
as_txt()[source]#
set_citation_data(citation_data)[source]#
class ScholarArticleParser(site=None)[source]#

Bases: object

ScholarArticleParser can parse HTML document strings obtained from Google Scholar. This is a base class; concrete implementations adapting to tweaks made by Google over time follow below.

handle_article(art)[source]#

The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.

handle_num_results(num_results)[source]#

The parser invokes this callback if it determines the overall number of results, as reported on the parsed results page. The base class implementation does nothing.

parse(html)[source]#

This method initiates parsing of HTML content, cleans resulting content as needed, and notifies the parser instance of resulting instances via the handle_article callback.

class ScholarArticleParser120726(site=None)[source]#

Bases: ScholarArticleParser

This class reflects update to the Scholar results page layout that Google made 07/26/12.

class ScholarConf[source]#

Bases: object

Helper class for global settings.

COOKIE_JAR_FILE = None#
LOG_LEVEL = 1#
MAX_PAGE_RESULTS = 10#
SCHOLAR_SITE = 'http://scholar.google.com'#
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:27.0) Gecko/20100101 Firefox/27.0'#
VERSION = '2.10'#
class ScholarQuerier[source]#

Bases: object

ScholarQuerier instances can conduct a search on Google Scholar with subsequent parsing of the resulting HTML content. The articles found are collected in the articles member, a list of ScholarArticle instances.

GET_SETTINGS_URL = 'http://scholar.google.com/scholar_settings?sciifh=1&hl=en&as_sdt=0,5'#
class Parser(querier)[source]#

Bases: ScholarArticleParser120726

handle_article(art)[source]#

The parser invokes this callback on each article parsed successfully. In this base class, the callback does nothing.

handle_num_results(num_results)[source]#

The parser invokes this callback if it determines the overall number of results, as reported on the parsed results page. The base class implementation does nothing.

SET_SETTINGS_URL = 'http://scholar.google.com/scholar_setprefs?q=&scisig=%(scisig)s&inststart=0&as_sdt=1,5&as_sdtp=&num=%(num)s&scis=%(scis)s%(scisf)s&hl=en&lang=all&instq=&inst=569367360547434339&save='#
add_article(art)[source]#
apply_settings(settings)[source]#

Applies settings as provided by a ScholarSettings instance.

clear_articles()[source]#

Clears any existing articles stored from previous queries.

get_citation_data(article)[source]#

Given an article, retrieves citation link. Note, this requires that you adjusted the settings to tell Google Scholar to actually provide this information, prior to retrieving the article.

parse(html)[source]#

This method allows parsing of provided HTML content.

save_cookies()[source]#

This stores the latest cookies we’re using to disk, for reuse in a later session.

send_query(query)[source]#

This method initiates a search query (a ScholarQuery instance) with subsequent parsing of the response.

class ScholarQuery[source]#

Bases: object

The base class for any kind of results query we send to Scholar.

get_url()[source]#

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_num_page_results(num_page_results)[source]#
class ScholarSettings[source]#

Bases: object

This class lets you adjust the Scholar settings for your session. It’s intended to mirror the features tunable in the Scholar Settings pane, but right now it’s a bit basic.

CITFORM_BIBTEX = 4#
CITFORM_ENDNOTE = 3#
CITFORM_NONE = 0#
CITFORM_REFMAN = 2#
CITFORM_REFWORKS = 1#
is_configured()[source]#
set_citation_format(citform)[source]#
set_per_page_results(per_page_results)[source]#
class ScholarUtils[source]#

Bases: object

A wrapper for various utensils that come in handy.

LOG_LEVELS = {'debug': 4, 'error': 1, 'info': 3, 'warn': 2}#
static ensure_int(arg, msg=None)[source]#
static log(level, msg)[source]#
class SearchScholarQuery[source]#

Bases: ScholarQuery

This version represents the search query parameters the user can configure on the Scholar website, in the advanced search options.

SCHOLAR_QUERY_URL = 'http://scholar.google.com/scholar?as_q=%(words)s&as_epq=%(phrase)s&as_oq=%(words_some)s&as_eq=%(words_none)s&as_occt=%(scope)s&as_sauthors=%(authors)s&as_publication=%(pub)s&as_ylo=%(ylo)s&as_yhi=%(yhi)s&as_vis=%(citations)s&btnG=&hl=en%(num)s&as_sdt=%(patents)s%%2C5'#
get_url()[source]#

Returns a complete, submittable URL string for this particular query instance. The URL and its arguments will vary depending on the query.

set_author(author)[source]#

Sets names that must be on the result’s author list.

set_include_citations(yesorno)[source]#
set_include_patents(yesorno)[source]#
set_phrase(phrase)[source]#

Sets phrase that must be found in the result exactly.

set_pub(pub)[source]#

Sets the publication in which the result must be found.

set_scope(title_only)[source]#

Sets Boolean indicating whether to search entire article or title only.

set_timeframe(start=None, end=None)[source]#

Sets timeframe (in years as integer) in which result must have appeared. It’s fine to specify just start or end, or both.

set_words(words)[source]#

Sets words that all must be found in the result.

set_words_none(words)[source]#

Sets words of which none must be found in the result.

set_words_some(words)[source]#

Sets words of which at least one must be found in result.

class SoupKitchen[source]#

Bases: object

Factory for creating BeautifulSoup instances.

static make_soup(markup, parser=None)[source]#

Factory method returning a BeautifulSoup instance. The created instance will use a parser of the given name, if supported by the underlying BeautifulSoup instance.

citation_export(querier)[source]#
csv(querier, header=False, sep='|')[source]#
encode(s)[source]#
main()[source]#
txt(querier, with_globals)[source]#

minifold.search module#

class SearchFilter(search_values: list, attributes: list, match: callable)[source]#

Bases: object

The SearchFilter implements a minifold filter allowing to filter minifold entries based on a search predicate.

Constructor.

Parameters:
contains(x: object, y: object) bool[source]#

Tests whether two objects are equal (wraps the == operator).

Example

>>> equals("bar", "foobar")
True
Parameters:
  • x (object) – An object.

  • y (object) – An object.

Returns:

True if and only if x in y, False otherwise.

contains_words(word: str, sentence: str, ignore_case: bool = True)[source]#

Checks whether a word is contained in a string (e.g., a sentence).

Examples

>>> contains_words("earth", "Earth is a planet.")
True
>>> contains_words("earth", "A terrible earthquake.")
False
Parameters:
  • word (str) – The searched word.

  • sentence (str) – The queried string.

  • ignore_case (bool) – Pass True if the search is not case sensitive, False otherwise. Defaults to True.

Returns:

True if word has been found in sentence, False otherwise.

equals(x: object, y: object) bool[source]#

Tests whether two objects are equal (wraps the == operator).

Examples

>>> equals("foo", "foo")
True
>>> equals("foo", "FOO")
False
Parameters:
  • x (object) – An object.

  • y (object) – An object.

Returns:

True if and only if x == y, False otherwise.

lower_case_contains(x: str, y: str) bool[source]#

Checks whether a string is included in another one without considering the case.

Example

>>> lower_case_contains("foo", "barFOObar")
True
Parameters:
  • x (str) – The search string.

  • y (str) – The queried string.

Returns:

True if and only if the both strings are equal without considering the case, False otherwise.

lower_case_equals(x: str, y: str) bool[source]#

Checks whether two string are equals without considering the case.

Example

>>> lower_case_equals("foo", "FOO")
True
Parameters:
  • x (str) – A string.

  • y (str) – A string.

Returns:

True if and only if the both strings are equal without considering the case, False otherwise.

search(entries: list, attributes: list, search_values: list, match: callable = <function equals>) list[source]#

Searches the entries whose at least one attribute of interest is matched a searched word

Parameters:
Returns:

The subset of entries matched by the search.

minifold.select module#

class SelectConnector(child: Connector, attributes: list)[source]#

Bases: Connector

The SelectConnector class implements the SELECT statement in a minifold pipeline.

Constructor.

Parameters:
  • child (Connector) – The child minifold Connector instance.

  • attributes (list) – The selected keys.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this SelectConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

select(entries: list, attributes: list) list[source]#

Implements the SELECT statement for a list of minifold entries.

Example

>>> select([{"a": 1, "b": 2, "c": 3}, {"a": 10, "b": 20, "c": 30}], ["a", "b"])
[{'a': 1, 'b': 2}, {'a': 10, 'b': 20}]
Parameters:
  • entries (list) – A list of minifold entries.

  • attributes (list) – The selected keys.

Returns:

The input entries, restricted to the key of interest.

minifold.singleton module#

class Singleton[source]#

Bases: type

The Singleton allows to define singleton classes, i.e. classes that can be instantiated at most once.

>>> class MyClass(metaclass=Singleton): pass
>>> x = MyClass()
>>> y = MyClass()
>>> x is y
True

Based on this thread.

s_instances = {}#

minifold.sort_by module#

class SortByConnector(attributes: list, child: Connector, desc: bool = False)[source]#

Bases: Connector

The SortByConnector class implements the SORT BY statement in a minifold pipeline.

Constructor.

Parameters:
  • attributes (list) – The list of entry keys used to sort.

  • child (Connector) – The child minifold Connector instance.

desc (bool): Pass True to sort by ascending order,

False otherwise.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this SortByConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

property desc: bool#

Checks whether this SortByConnector sorts the entries by descending order.

Returns:

True if this SortByConnector sorts the entries by descending order, False otherwise.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

sort_by(attributes: list, entries: list, desc: bool = False) list[source]#

Sorts a list of minifold entries.

Parameters:
  • attributes (list) – The list of entry keys used to sort.

  • entries (list) – A list of minifold entries.

  • desc (bool) – Pass True to sort by ascending order, False otherwise.

Returns:

The sorted entries, with respect to functor.

sort_by_impl(functor: ValuesFromDictFonctor, entries: list, desc: bool = True) list[source]#

Implementation details of sort_by().

Parameters:
  • functor (ValuesFromDictFonctor) – The functor allowing to extract the values used to sort the entries.

  • entries (list) – A list of minifold entries.

  • desc (bool) – Pass True to sort by ascending order, False otherwise.

Returns:

The sorted entries, with respect to functor.

minifold.strings module#

remove_html_escape_sequences(s: str) str[source]#

Removes the HTML sequences.

Example

>>> remove_html_escape_sequences("cha&icirc;ne")
'chane'
Parameters:

s (str) – The input string.

Returns:

The converted string.

remove_html_tags(s: str) str[source]#

Removes the HTML tag from an HTML string.

Example

>>> remove_html_tags("<a href='#'>A link</a> <u><i>italic underlined</i><u>")
'A link bold italic underlined'
Parameters:

s (str) – The input string.

Returns:

The converted string.

remove_latex_escape_sequence(s: str) str[source]#

Removes the latex sequences.

Example

>>> remove_latex_escape_sequence("n\oe ud")
'nud'
Parameters:

s (str) – The input string.

Returns:

The converted string.

remove_punctuation(s: str) str[source]#

Replaces the punctuation characters by spaces.

Example

>>> remove_punctuation("Example: a sentence, with punctuation."
'Example  a sentence  with punctuation '
Parameters:

s (str) – The input string.

Returns:

The converted string.

to_canonic_fullname(s: str) str[source]#

Canonizes a fullname.

Parameters:

s (str) – The input fullname.

Returns:

The converted fullname.

to_canonic_string(s: str) str[source]#

Canonizes a string.

Parameters:

s (str) – The input string.

Returns:

The converted string.

to_international_chr(c: str) str[source]#

Converts international characters to the corresponding character(s) in a-zA-Z.

Examples

>>> to_international_chr("ß")
'ss'
>>> to_international_chr("é")
'e'
Parameters:

c (str) – The input character.

Returns:

The corresponding characters(s).

to_international_string(s: str) str[source]#

Converts a string involving international characters (see :func:py:to_international_chr) to the corresponding string.

Example

>>> to_international_string("élève")
'eleve'
Parameters:

s (str) – The input string.

Returns:

The converted string.

unicode_to_utf8(s: str) str[source]#

Converts an unicode string to an UTF-8 string.

Parameters:

s (str) – The input unicode string.

Returns:

The corresponding UTF-8 string.

minifold.twitter module#

class TwitterConnector(twitter_id: str, consumer_key: str, consumer_secret: str, access_token: str, access_token_secret: str)[source]#

Bases: Connector

The TwitterConnector is a gateway minifold allowing to fetch tweets from Twitter.

Constructor.

Parameters:
  • twitter_id (str) – The Twitter identifier.

  • consumer_key (str) – The Twitter API Key.

  • consumer_secret (str) – The Twitter secret.

  • access_token (str) – The Twitter access token.

  • access_token_secret – The Twitter access token secret.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this WhereConnector instance.

Parameters:

object (str) – The name of the collection. Valid collections are "self" and "feed".

Returns:

The set of corresponding attributes.

connect()[source]#

Connect to the Twitter API.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

tweet_to_dict(tweet) dict[source]#

Converts a tweet to a dictionary.

Parameters:

tweet – The input tweet.

Returns:

The corresponding dictionary.

minifold.union module#

class UnionConnector(children: list)[source]#

Bases: Connector

The UnionConnector class implements the UNION statement in a minifold pipeline.

Constructor.

Parameters:

child (Connector) – The list of children minifold Connector instances.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this WhereConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

query(query: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

union(list_entries: list) list[source]#

Make the union of several lists of entries. Use minifold.distinct() to remove duplicates.

Parameters:

list_entries – A list of lists of dicts.

Returns:

The resulting list of dicts.

union_gen(list_entries: list) iter[source]#

Iterator over a list of list of dicts.

Parameters:

list_entries – A list of lists of dicts.

Returns:

The resulting generator.

minifold.unique module#

class UniqueConnector(attributes: list, child)[source]#

Bases: Connector

The UniqueConnector class implements the GROUP BY statement in a minifold pipeline.

Constructor.

Parameters:
  • attributes (list) – The list of entry keys used to form the aggregates.

  • child (Connector) – The child minifold Connector instance.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this UniqueConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

unique(attributes: list, entries: list) list[source]#

Implements the UNIQUE statement for a list of minifold entries.

Parameters:
  • attributes (list) – The list of entry keys used to determine the uniqueness.

  • entries (list) – A list of minifold entries.

Returns:

The remaining entries once the UNIQUE filtering has been applied.

unique_impl(functor: ValuesFromDictFonctor, entries: list) list[source]#

Implementation details of unique().

Parameters:
  • functor (ValuesFromDictFonctor) – The functor allowing to extract the values used to form the aggregates.

  • entries (list) – A list of minifold entries.

Returns:

A dictionary where each entries are unique with respect to functor.

minifold.unnest module#

class UnnestConnector(map_key_unnestedkey: dict, child: Connector)[source]#

Bases: Connector

The UnnestConnector class implements the unnest function (in PostgreSQL) in a minifold pipeline.

Constructor.

Parameters:

child (Connector) – The child minifold Connector instance.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this UnnestConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child: Connector#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

unnest_key(key: str) str[source]#

Retrieves the unnested key corresponding to a given key.

Parameters:

key (str) – A key identifying an minifold entry attribute whose value is a list.

Raises:

KeyError

Returns:

The corresponding unnested key.

unnest(map_key_unnestedkey: dict, entries: list) list[source]#

Implements the unnest PostgresSQL function for a list of minifold entries.

Example

>>> unnest({"a": "A", "b": "B"}, [{"a": [1, 2, 3], "b": [10, 20, 30]}])
[{'A': 1}, {'A': 2}, {'A': 3}, {'B': 10}, {'B': 20}, {'B': 30}]
Parameters:
  • map_key_unnestedkey (dict) – A dictionary which maps key corresponding to its corresponding unnested key.

  • entries (list) – A list of minifold entries.

Returns:

The list unnested entries. Each entry is a dictionary with a single key-value pair, where the key is an unnested key.

minifold.values_from_dict module#

class ValuesFromDictFonctor(attributes: list)[source]#

Bases: object

The ValuesFromDictFonctor class is an internal minifold functor used to extract a subset of values from a dictionary.

It is used by several minifold Connector classes, including:

  • GroupByConnector;

  • SortByConnector;

  • UniqueConnector.

Constructor.

Parameters:

attributes (list) – The keys of the values to be extracted.

property attributes: list#

Retrieves the keys of interest related to this ValuesFromDictFonctor instance.

Returns:

The attributes of interest related to this ValuesFromDictFonctor instance.

minifold.where module#

class WhereConnector(child: Connector, keep_if: callable)[source]#

Bases: Connector

The WhereConnector class implements the WHERE statement in a minifold pipeline.

Constructor.

Parameters:
  • child (Connector) – The child minifold Connector instance.

  • keep_if (callable) – A function such that f(entry) returns True if entry must be kept, False otherwise.

attributes(object: str) set[source]#

Lists the available attributes related to a given collection of minifold entries exposed by this WhereConnector instance.

Parameters:

object (str) – The name of the collection.

Returns:

The set of corresponding attributes.

property child#

Accessor to the child minifold Connector instance.

Returns:

The child minifold Connector instance.

property keep_if#

Accessor to the filtering function used by this WhereConnector instance.

Returns:

The child minifold Connector instance.

query(q: Query) list[source]#

Handles an input Query instance.

Parameters:

query (Query) – The handled query.

Returns:

The list of entries matching the input query.

where(entries: list, f: callable) list[source]#

Implements the WHERE statement for a list of minifold entries.

Example

>>> where([{"a": 1, "b": 1}, {"a": 2, "b": 2}, {"a": 3, "b": 3}], lambda e: e["a"] <= 2)
[{'a': 1, 'b': 1}, {'a': 2, 'b': 2}]
Parameters:
  • entries (list) – A list of minifold entries.

  • f (callable) – A function such that f(entry) returns True if entry must be kept, False otherwise.

Returns:

The kept entries.

Module contents#

Top-level package.