paperfetcher package
Submodules
paperfetcher.apiclients module
Client implementations to communicate with various APIs.
- class paperfetcher.apiclients.COCIQuery(components={}, query_params={})
Bases:
paperfetcher.apiclients.Query
Class for structuring and executing COCI REST API queries.
- Parameters
components (collections.OrderedDict) – Components to append to the base URL.
query_params (collections.OrderedDict) – Ordered dictionary of query parameters.
- components
Components to append to the base URL.
- Type
collections.OrderedDict
- query_base
Base URL for query (https://opencitations.net/index/coci/api/v{}/…).
- Type
str
- query_params
Dictionary of query parameters.
- Type
collections.OrderedDict
- headers
Dictionary of HTTP headers.
- Type
dict
- response
Response recieved on executing GET query to the COCI API.
- Type
requests.Response
Examples
Querying the references of a paper with a known DOI:
>>> query = COCIQuery(components=OrderedDict([("references", "10.1021/acs.jpcb.1c02191")])) >>> query() >>> query.response <Response [200]> >>> query.response.json() [{'oci': '020010002013610122837192512113701120002010901-0200100000236191212370201090809', 'creation': '2021-05-12', 'timespan': 'P9Y5M19D', ...}]
Querying the citations of a paper with a known DOI:
>>> query = COCIQuery(components=OrderedDict([("citations", "10.1021/acs.jpcb.1c02191")])) >>> query() >>> query.response <Response [200]> >>> query.response.json() [{'oci': '0200100000236252421370200020100050206-020010002013610122837192512113701120002010901', 'creation': '2021-10-02', 'timespan': 'P4M20D', 'journal_sc': 'no', 'author_sc': 'no', 'citing': '10.1002/pol.20210526', 'cited': '10.1021/acs.jpcb.1c02191'}]
- class paperfetcher.apiclients.CrossrefQuery(components={}, query_params={})
Bases:
paperfetcher.apiclients.Query
Class for structuring and executing Crossref REST API queries.
Query components can be added to the base URL by passing an ordered dictionary to the components argument. For example, the components dictionary {“comp1-key”: “comp1-value”, “comp2”: None} changes the query URL to https://api.crossref.org/comp1-key/comp1-value/comp2/.
- Parameters
components (collections.OrderedDict) – Components to append to the base URL.
query_params (collections.OrderedDict) – Ordered dictionary of query parameters.
- components
Components to append to the base URL.
- Type
collections.OrderedDict
- query_base
Base URL for query (https://api.crossref.org/…).
- Type
str
- query_params
Dictionary of query parameters.
- Type
collections.OrderedDict
- headers
Dictionary of HTTP headers.
- Type
dict
- response
Response recieved on executing GET query to the Crossref API.
- Type
requests.Response
Examples
Querying the metadata of a paper with a known DOI:
>>> query = CrossrefQuery(components={"works": "10.1021/acs.jpcb.1c02191"}) >>> query() >>> query.response <Response [200]> >>> query.response.json() {'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0', 'message': {...}}
Query to fetch all articles from a journal with a known ISSN:
>>> components = OrderedDict([("journals", "1520-5126"), ("works", None)]) >>> query = CrossrefQuery(components) >>> query() >>> query.response <Response [200]> >>> query.response.json() {'status': 'ok', 'message-type': 'work', 'message-version': '1.0.0', 'message': {...}}
- class paperfetcher.apiclients.Query(base_url=None, query_params: dict = {}, headers: str = {})
Bases:
object
Base class for structuring and executing HTTP GET queries.
- Parameters
base_url (str) – Base URL for query (such as api.xyz.com/get).
query_params (dict) – Dictionary of query parameters.
headers (dict) – Dictionary of HTTP headers to pass along with the query.
- query_base
Base URL for query (such as api.xyz.com/get).
- Type
str
- query_params
Dictionary of query parameters.
- Type
dict
- headers
Dictionary of HTTP headers.
- Type
dict
- response
Response recieved on executing GET query.
- Type
requests.Response
Examples
A simple Query to the Github REST API:
>>> query = Query("https://api.github.com") >>> query() >>> query.response <Response [200]>
A Query to the Github REST API to fetch a list of all public repositories in the paperfetcher organization:
>>> query = Query("https://api.github.com/orgs/paperfetcher/repos", ... query_params={"type": "public"}, ... headers={"Accept": "application/vnd.github.v3+json"}) >>> query() >>> query.response <Response [200]>
- property response
paperfetcher.datastructures module
Custom data structures for paperfetcher.
- class paperfetcher.datastructures.CitationsDataset(field_names: tuple, items: list = [])
Bases:
paperfetcher.datastructures.Dataset
Stores a tabular dataset of citations, with multiple custom fields.
CitationsDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.
- Parameters
field_names (tuple) – Names (str) of fields.
items (list) – List of citations to store (default=[]). Each citation should be an iterable of length len(field_names).
Examples
To create a CitationsDataset object to store the DOI, URL, article title, authors, and date of issue for each citation:
>>> field_names = ["DOI", "URL", "title", "author", "issued"] >>> data = [["10.xxyy/0.0.0.000001", "https://dx.doi.org/10.xxyy/0.0.0.000001", "A study of A", "P, Q and R", "2020-02-20"], ... ["10.xxyy/0.0.0.000002", "https://dx.doi.org/10.xxyy/0.0.0.000002", "An investigation into B", "Q, R and P", "2020-03-20"], ... ["10.xxyy/0.0.0.000003", "https://dx.doi.org/10.xxyy/0.0.0.000003", "The causes of C in D", "R, P and Q", "2020-04-20"], ... ["10.xxyy/0.0.0.000004", "https://dx.doi.org/10.xxyy/0.0.0.000004", "Characterizing the role of E on F", "P and Q", "2020-05-20"]] >>> ds = CitationsDataset(field_names, data)
To add a citation to the CitationsDataset:
>>> ds.append(["10.xxyy/0.0.0.000005", "https://dx.doi.org/10.xxyy/0.0.0.000005", "The G effect", "Q and P", "2020-06-20"])
To export the DOIDataset object to a pandas DataFrame:
>>> df = ds.to_df() >>> df DOI URL title author issued 0 10.xxyy/0.0.0.000001 https://dx.doi.org/10.xxyy/0.0.0.000001 A study of A P, Q and R 2020-02-20 1 10.xxyy/0.0.0.000002 https://dx.doi.org/10.xxyy/0.0.0.000002 An investigation into B Q, R and P 2020-03-20 2 10.xxyy/0.0.0.000003 https://dx.doi.org/10.xxyy/0.0.0.000003 The causes of C in D R, P and Q 2020-04-20 3 10.xxyy/0.0.0.000004 https://dx.doi.org/10.xxyy/0.0.0.000004 Characterizing the role of E on F P and Q 2020-05-20 4 10.xxyy/0.0.0.000005 https://dx.doi.org/10.xxyy/0.0.0.000005 The G effect Q and P 2020-06-20
To save data to disk:
>>> ds.save_txt("cits.txt") >>> ds.save_csv("cits.csv") >>> ds.save_excel("cits.xlsx")
- append(item)
Adds a citation to the dataset.
- extend(items)
Adds each citation from a list of citations (i.e. eacher inner list of nested list) to the dataset.
- save_csv(file)
Saves dataset to .csv file. The first row of the CSV file contains field names.
- save_excel(file)
Saves dataset to Excel file (uses Pandas). The first row of the Excel file contains field names.
- save_txt(file)
Saves dataset to .txt file.
- to_df()
Converts dataset to DataFrame.
- class paperfetcher.datastructures.DOIDataset(items: list = [])
Bases:
paperfetcher.datastructures.Dataset
Stores a dataset of DOIs.
DOIDatasets can be exported to pandas DataFrames, and loaded from or saved to disk in text, CSV, or Excel file formats.
- Parameters
items (list) – List of DOIs (str) to store (default=[]).
Examples
To create a DOIDataset object from a list of DOIs:
>>> ds = DOIDataset(["x1.y1.z1/123123", "x2.y2.z2/456456"])
To add a DOI to the DOIDataset object:
>>> ds.append("x3.y3.z3/789789")
To export the DOIDataset object to a pandas DataFrame:
>>> df = ds.to_df() >>> df DOI 0 x1.y1.z1/123123 1 x2.y2.z2/456456 3 x3.y3.z3/789789
To save data to disk:
>>> ds.save_txt("dois.txt") >>> ds.save_csv("dois.csv") >>> ds.save_excel("dois.xlsx")
- extend_dataset(ds: paperfetcher.datastructures.DOIDataset)
Appends all items from DOIDataset ds to the end of the current dataset.
- save_csv(file)
Saves dataset to .csv file.
- save_excel(file)
Saves dataset to Excel file.
- save_txt(file)
Saves dataset to .txt file.
- to_df()
Converts dataset to DataFrame.
- to_txt_string()
Returns a string which can be written to .txt file
- class paperfetcher.datastructures.Dataset(items: list = [])
Bases:
object
Abstract interface that defines functions for child Dataset classes to implement.
Datasets are designed to store [usually tabular] data (as input to or output from paperfetcher searches), export data to pandas DataFrames, and load/save data to disk using common data formats (txt, csv, xlsx).
- Parameters
items (iterable) – Items to store in dataset (default=[]).
- append(item)
Adds an item to the dataset.
- extend(items: list)
Adds each item from a list of items to the dataset.
- classmethod from_csv(file)
Loads dataset from .csv file.
- classmethod from_excel(file)
Loads dataset from Excel file file.
- classmethod from_txt(file)
Loads dataset from .txt file.
- save_csv(file)
Saves dataset to .csv file.
- save_excel(file)
Saves dataset to Excel file.
- save_txt(file)
Saves dataset to .txt file.
- to_df()
Converts dataset to DataFrame.
- class paperfetcher.datastructures.HeadlessRISWriter(*, mapping: Optional[Dict] = None, list_tags: Optional[List[str]] = None, ignore: Optional[List[str]] = None, skip_unknown_tags: bool = False, enforce_list_tags: bool = True)
Bases:
rispy.writer.BaseWriter
- DEFAULT_LIST_TAGS: List[str] = ['A1', 'A2', 'A3', 'A4', 'AU', 'KW', 'N1']
- DEFAULT_MAPPING: Dict = {'A1': 'first_authors', 'A2': 'secondary_authors', 'A3': 'tertiary_authors', 'A4': 'subsidiary_authors', 'AB': 'abstract', 'AD': 'author_address', 'AN': 'accession_number', 'AU': 'authors', 'C1': 'custom1', 'C2': 'custom2', 'C3': 'custom3', 'C4': 'custom4', 'C5': 'custom5', 'C6': 'custom6', 'C7': 'custom7', 'C8': 'custom8', 'CA': 'caption', 'CN': 'call_number', 'CY': 'place_published', 'DA': 'date', 'DB': 'name_of_database', 'DO': 'doi', 'DP': 'database_provider', 'EP': 'end_page', 'ER': 'end_of_reference', 'ET': 'edition', 'ID': 'id', 'IS': 'number', 'J2': 'alternate_title1', 'JA': 'alternate_title2', 'JF': 'alternate_title3', 'JO': 'journal_name', 'KW': 'keywords', 'L1': 'file_attachments1', 'L2': 'file_attachments2', 'L4': 'figure', 'LA': 'language', 'LB': 'label', 'M1': 'note', 'M3': 'type_of_work', 'N1': 'notes', 'N2': 'notes_abstract', 'NV': 'number_of_volumes', 'OP': 'original_publication', 'PB': 'publisher', 'PY': 'year', 'RI': 'reviewed_item', 'RN': 'research_notes', 'RP': 'reprint_edition', 'SE': 'section', 'SN': 'issn', 'SP': 'start_page', 'ST': 'short_title', 'T1': 'primary_title', 'T2': 'secondary_title', 'T3': 'tertiary_title', 'TA': 'translated_author', 'TI': 'title', 'TT': 'translated_title', 'TY': 'type_of_reference', 'UK': 'unknown_tag', 'UR': 'url', 'VL': 'volume', 'Y1': 'publication_year', 'Y2': 'access_date'}
- PATTERN: str = '{tag} - {value}'
- START_TAG: str = 'TY'
- class paperfetcher.datastructures.RISDataset(items: list = [])
Bases:
paperfetcher.datastructures.Dataset
Stores a dataset of RIS items. RIS items are rispy-readable dictionaries.
An RISDataset can be created from an RIS-formatted string, or from an RIS file. An RISDataset can be written to an RIS-formatted string, or to an RIS file. Individual item dictionaries in an RISDataset can be modified to add new tags or change the values of existing tags.
- Parameters
items (list) – List of citations to store (default=[]). Each citation should be a rispy-readable dictionary.
Examples
To create an RISDataset from a list of rispy-readable dictionaries (see rispy doc on GitHub for details):
>>> dict_list = [{'journal_name': ...}, {'journal_name: ...'}, ...] >>> ds = RISDataset(dict_list)
To load an RISDataset from an RIS-formatted string: >>> ds = RISDataset.from_ris_string(ris_string)
To load an RISDataset from an RIS file: >>> ds = RISDataset.from_ris(ris_file)
- extend_dataset(ds: paperfetcher.datastructures.RISDataset)
Appends all items from RISDataset ds to the end of the current dataset.
- classmethod from_ris(file)
Loads dataset from RIS file.
- classmethod from_ris_string(ris_string)
Loads dataset from RIS-formatted string.
- save_ris(filename, headers=False)
Saves dataset to .ris file.
- Parameters
filename – Path to file to write RIS data to.
headers (bool, default=False) – If set to true, writes reference number before each RIS entry.
- to_ris_string(headers=False)
Returns a string which can be written to .ris file.
- Parameters
headers (bool, default=False) – If set to true, writes reference number before each RIS entry.
paperfetcher.exceptions module
Definitions of all Exceptions raised by paperfetcher.
- exception paperfetcher.exceptions.ContentNegotiationError
Bases:
Exception
Exception raised when content negotiation fails.
- exception paperfetcher.exceptions.DatasetError
Bases:
Exception
Exception raised when an incorrect operation is performed on a dataset.
- exception paperfetcher.exceptions.QueryError
Bases:
Exception
Exception raised when query fails.
- exception paperfetcher.exceptions.RISParsingError
Bases:
Exception
Exception raised when RIS parsing fails.
- exception paperfetcher.exceptions.SearchError
Bases:
Exception
Exception raised when search fails.
paperfetcher.handsearch module
Classes to fetch all journal works (articles) matching a set of keywords and within a given date range by querying various APIs.
- class paperfetcher.handsearch.CrossrefSearch(ISSN='', type='journal-article', keyword_list=None, from_date=None, until_date=None, batch_size=20, sort_order='desc')
Bases:
object
Retrieves all works from a journal, given its ISSN, which match a set of keywords and are within a date range.
Calling a CrossrefSearch object performs the search. A search object can be called with the arguments display_progress_bar (True/False; default=True) to toggle the display of a search progress bar, select (True/False; default=False), and select_fields (list) to query only a subset of metadata for each journal article.
If select is False, a full (memory and time intensive) search is performed, fetching all metadata associated with each journal work.
If select is True, a subset of fields to fetch can be specified using the select_fields parameter. Check the Crossref REST API doc for details on which field names are permissible.
Note
Performing a search with no keywords and select=False can be very time- and memory- intensive. The search object will complain when such a search is performed.
- Parameters
ISSN (str) – Journal (web) ISSN.
type (str) – Type of works to fetch (default=”journal-article”).
keyword_list (list) – List of keywords (str) to query with (default=None).
from_date (str) – Fetch articles published from (and after) this date (format=”YYYY-MM-DD”, default=None).
until_date (str) – Fetch articles published until this date (format=”YYYY-MM-DD”, default=None).
batch_size (int) – Number of works to fetch in each batch (default=20).
sort_order (str) – Order in which to sort works by date (“asc” or “desc”, default=”desc”).
- ISSN
Journal (web) ISSN.
- Type
str
- type
Type of works to fetch (default=”journal-article”).
- Type
str
- keyword_list
List of keywords (str) to query with.
- Type
list
- from_date
Fetch articles published from (and after) this date (format=”YYYY-MM-DD”).
- Type
str
- until_date
Fetch articles published until this date (format=”YYYY-MM-DD”).
- Type
str
- batch_size
Number of works to fetch in each batch (default=20).
- Type
int
- sort_order
Order in which to sort works by date (“asc” or “desc”, default=”desc”).
- Type
str
- results
List of dictionaries, each dictionary corresponds to a work.
- Type
list
Examples
>>> search = CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01", until_date="2020-01-01") >>> search() >>> len(search) 13 >>> ds = search.get_DOIDataset() >>> ds.to_df() DOI 0 10.1021/jacs.9b09103 1 10.1021/jacs.9b06862 2 10.1021/jacs.9b09111 3 10.1021/jacs.9b05874 4 10.1021/jacs.9b02820 5 10.1021/jacs.9b05136 6 10.1021/jacs.9b02742 7 10.1021/jacs.9b00577 8 10.1021/jacs.8b11448 9 10.1021/jacs.8b12877 10 10.1021/jacs.8b11667 11 10.1021/jacs.8b08298 12 10.1021/jacs.7b11537
- dry_run(select=False, select_fields=[])
How many works will this search fetch?
- get_CitationsDataset(field_list=[], field_parsers_list=[])
Parses a selection of fields from search results and returns them as a CitationsDataset object.
- Parameters
field_list (list) – Names of fields to parse (see Crossref REST API doc for permissible field name values).
field_parsers_list (list) – List of field parser functions corresponding to each field name. A None value means that no parser is needed for that field.
- Returns
CitationsDataset
Example
>>> search = handsearch.CrossrefSearch(ISSN="1520-5126", keyword_list=["hydration"], from_date="2018-01-01", ... until_date="2020-01-01") >>> search(select=True, select_fields=['DOI', 'URL', 'title', 'author', 'issued']) >>> ds = search.get_CitationsDataset(field_list=['DOI', 'URL', 'title', 'author', 'issued'], ... field_parsers_list=[None, None, parsers.crossref_title_parser, ... parsers.crossref_authors_parser, parsers.crossref_date_parser])
- get_DOIDataset()
Extracts DOIs from search results and returns them as a DOIDataset object.
- Returns
DOIDataset
- get_RISDataset(extra_field_list=[], extra_field_parser_list=[], extra_field_rispy_tags=[])
Extracts DOIs from search results and fetches RIS data for each DOI using Crossref’s content negotiation service.
Extra fields in the search results that are not automatically populated by Crossref’s content negotation service can be mapped to the RIS format (through rispy’s mapping) using the extra_fields, extra_field_parser_list, and extra_field_rispy_tags arguments.
- Parameters
extra_field_list (list) – List of extra fields to parse and include in RIS file (see Crossref REST API doc for permissible field name values).
extra_field_parser_list (list) – List of field parser functions corresponding to each extra field name. A None value means that no parser is needed for that field.
extra_field_rispy_tags (list) – List of rispy tags for each extra field.
paperfetcher.parsers module
Functions to parse data returned from queries.
New parsers can be added here.
- paperfetcher.parsers.crossref_authors_parser(author_array)
Function to parse authors.
- Returns
str
- paperfetcher.parsers.crossref_date_parser(date)
Function to parse date.
- Returns
str
- paperfetcher.parsers.crossref_title_parser(title)
Function to parse title.
- Returns
str
paperfetcher.snowballsearch module
Classes to fetch all journal articles in the references of (i.e. backward search) or citing (i.e. forward search) a set of journal articles.
For backward search, you can use either Crossref or COCI (should be equivalent). For forward search, you can only use COCI at the moment.
- class paperfetcher.snowballsearch.COCIBackwardReferenceSearch(search_dois: list)
Bases:
object
Retrieves the (DOIs of) all articles in the references of a list of (DOIs of) articles by using the COCI REST API.
- Parameters
search_dois (list) – List of DOIs (str) to fetch references of.
- search_dois
List of DOIs (str) to fetch references of.
- Type
list
- result_dois
Set of DOIs which are referenced by the DOIs in search_dois.
- Type
set
Example
>>> search = snowballsearch.COCIBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"]) >>> search() >>> len(search) 140 >>> search.result_dois {'10.1021/jp972543+', '10.1073/pnas.0708088105', ... , '10.1073/pnas.0705830104'}
- classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)
Constructs a search object from a DOIDataset.
- Parameters
search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
- Returns
COCIBackwardReferenceSearch
- get_DOIDataset()
Returns search results as a DOIDataset object.
- Returns
DOIDataset
- get_RISDataset()
Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.
- Returns
RISDataset
- class paperfetcher.snowballsearch.COCIForwardCitationSearch(search_dois: list)
Bases:
object
Retrieves the (DOIs of) all articles citing a list of (DOIs of) articles by using the COCI REST API.
- Parameters
search_dois (list) – List of DOIs (str) to fetch citations of.
- search_dois
List of DOIs (str) to fetch citations of.
- Type
list
- result_dois
Set of DOIs which cite the DOIs in search_dois.
- Type
set
Example
>>> search = snowballsearch.COCIForwardCitationSearch(["10.1021/acs.jpcb.8b11423", "10.1073/pnas.2018234118"]) >>> search() >>> len(search) 11 >>> search.result_dois {'10.1039/c9sc02097g', '10.1021/acs.jpcb.1c05748', ... , '10.1021/acs.jpclett.9b02052'}
- classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)
Constructs a search object from a DOIDataset.
- Parameters
search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
- Returns
COCIForwardReferenceSearch
- get_DOIDataset()
Returns search results as a DOIDataset object.
- Returns
DOIDataset
- get_RISDataset()
Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.
- Returns
RISDataset
- class paperfetcher.snowballsearch.CrossrefBackwardReferenceSearch(search_dois: list)
Bases:
object
Retrieves (the DOIs of) all articles in the references of a list of (DOIs of) articles by using the Crossref REST API.
- Parameters
search_dois (list) – List of DOIs (str) to fetch references of.
- search_dois
List of DOIs (str) to fetch references of.
- Type
list
- result_dois
Set of DOIs which are referenced by the DOIs in search_dois.
- Type
set
Example
>>> search = snowballsearch.CrossrefBackwardReferenceSearch(["10.1021/acs.jpcb.1c02191", "10.1073/pnas.2018234118"]) >>> search() >>> len(search) 140 >>> search.result_dois {'10.1021/jp972543+', '10.1073/pnas.0708088105', ... , '10.1073/pnas.0705830104'}
- classmethod from_DOIDataset(search_dataset: paperfetcher.datastructures.DOIDataset)
Constructs a search object from a DOIDataset.
- Parameters
search_dataset (DOIDataset) – Dataset of DOIs to fetch references of.
- Returns
CrossrefBackwardReferenceSearch
- get_DOIDataset()
Returns search results as a DOIDataset object.
- Returns
DOIDataset
- get_RISDataset()
Returns search results as an RISDataset object. Uses the Crossref REST API for content negotation.
- Returns
RISDataset