svgdigitizer.pdf
Module for handling operations related to PDF files.
- class svgdigitizer.pdf.Pdf(pdf_filepath, doi=None)
Handles all interactions with the PDF file.
- property bibliographic_entry
Get the citation from the DOI provided PDF file. Returns a bibtex string.
EXAMPLES:
>>> from svgdigitizer.pdf import Pdf >>> from svgdigitizer.test.cli import TemporaryData >>> with TemporaryData("**/Hermann_2018_J._Electrochem._Soc._165_J3192.pdf") as directory: ... # do not assign instance to variable which keeps the file open and fails for windows ... Pdf(os.path.join(directory, "Hermann_2018_J._Electrochem._Soc._165_J3192.pdf")).bibliographic_entry '@article{Hermann_2018, title={Enhanced Electrocatalytic Oxidation ... year={2018}, pages={J3192–J3198} }'
- static build_identifier(citation, skip_words=('a', 'ab', 'aboard', 'about', 'above', 'across', 'after', 'against', 'al', 'along', 'amid', 'among', 'an', 'and', 'anti', 'around', 'as', 'at', 'before', 'behind', 'below', 'beneath', 'beside', 'besides', 'between', 'beyond', 'but', 'by', 'd', 'da', 'das', 'de', 'del', 'dell', 'dello', 'dei', 'degli', 'della', 'dell', 'delle', 'dem', 'den', 'der', 'des', 'despite', 'die', 'do', 'down', 'du', 'during', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'el', 'en', 'et', 'except', 'for', 'from', 'gli', 'i', 'il', 'in', 'inside', 'into', 'is', 'l', 'la', 'las', 'le', 'les', 'like', 'lo', 'los', 'near', 'nor', 'of', 'off', 'on', 'onto', 'or', 'over', 'past', 'per', 'plus', 'round', 'save', 'since', 'so', 'some', 'sur', 'than', 'the', 'through', 'to', 'toward', 'towards', 'un', 'una', 'unas', 'under', 'underneath', 'une', 'unlike', 'uno', 'unos', 'until', 'up', 'upon', 'versus', 'via', 'von', 'while', 'with', 'within', 'without', 'yet', 'zu', 'zum'))
Build the entry identifier based on a bibtex citation provided as BibliographyData (pybtex). Some common title words are omitted from the identifier by default (see skip_words). To change the omitted words it is possible to provide a custom word list.
Examples:
A complex title:
>>> from svgdigitizer.pdf import Pdf >>> from pybtex.database import parse_string >>> bibtex_string = '@article{Mar_Ol__2021, title={Surfaces are made by the devil: fooo,2 of XX(110)/other stuff.}, volume={145}, ISSN={0015-0057}, url={http://dx.doi.org/10.1016/j.what.2015.123456}, DOI={10.1016/j.what.2015.123456}, journal={My Journal}, publisher={Publisher}, author={Marí-Olé, J. and Foo, B.}, year={2013}, month=dec, pages={4567} }' #pylint: disable=line-too-long >>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex") >>> Pdf.build_identifier(bibliography_data) 'mari-ole_2013_surfaces_4567'
Special characters are skipped upon first word selection:
>>> bibtex_string = '@article{hermannEffectPHAnion2021, title = {The {{Effect}} of {{pH}} and {{Anion Adsorption}} on {{Formic Acid Oxidation}} on {{Au}}(111) {{Electrodes}}}, author = {Hermann, Johannes M. and Abdelrahman, Areeg and Jacob, Timo and Kibler, Ludwig A.}, year = 2021, month = jul, journal = {Electrochim. Acta}, volume = {385}, pages = {138279}, issn = {0013-4686}, doi = {10.1016/j.electacta.2021.138279} }' #pylint: disable=line-too-long >>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex") >>> Pdf.build_identifier(bibliography_data) 'hermann_2021_effect_138279'
Keep only the first meaningful word of the title:
>>> bibtex_string = '@article{Hermann_2018, title={An in the foo bar article}, volume={165}, ISSN={0013-4651}, url={http://dx.doi.org/10.1149/2.0221810jes}, DOI={10.1149/2.0221810jes}, journal={Journal of The Electrochemical Society}, publisher={The Electrochemical Society}, author={Hermann, Johannes M. and Jacob, Timo and Kibler, Ludwig A.}, year={2018}, month=aug, pages={J3192–J3198} }' #pylint: disable=line-too-long >>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex") >>> Pdf.build_identifier(bibliography_data) 'hermann_2018_foo_j3192'
Dashes in the first word are considered:
>>> bibtex_string = '@article{Hermann_2018, title={An-in the foo bar article}, volume={165}, ISSN={0013-4651}, url={http://dx.doi.org/10.1149/2.0221810jes}, DOI={10.1149/2.0221810jes}, journal={Journal of The Electrochemical Society}, publisher={The Electrochemical Society}, author={Hermann, Johannes M. and Jacob, Timo and Kibler, Ludwig A.}, year={2018}, month=aug, pages={J3192–J3198} }' #pylint: disable=line-too-long >>> bibliography_data = parse_string(bibtex_string, bib_format="bibtex") >>> Pdf.build_identifier(bibliography_data) 'hermann_2018_an-in_j3192'
- property doc
Holds the opened PDF file.
- property doi
Extract the DOI from the provided PDF or the provided string. Since in some cases additional pages are prepended to the PDF, the DOI is extracted from either the first or second page.
- export_png(page_idx, dpi)
Returns the requested PDF page as PNG with specified DPI.
- property num_pages
Number of pages of the PDF file.
- rename_by_key()
Rename the provided PDF file by the key derived from citation.