Articles in the Python category


Writing an EPUB parser. Part 1

Parsing Epubs

Recently I've become frustrated with the experience of reading books on my Kindle Paperwhite. The swipe features, really bother me. I really like MoonReader on Android, but reading on my phone isn't always pleasing. This lead me to look into other hardware. I've been eyeing the BOOX company a while ago, but definitely considering some of their new offerings some time. Until the time I can afford the money to splurge on a new ebook reader, I've decided to start a new project, making my own ebook reader tools!

I'm starting with EPUBs, as this is one of the easiest to work with. At its core, an EPUB is a zip file with the .epub extension instead of .epub with many individual XHTML file chapters inside it. You can read more of how they're structured yourself over at FILEFORMAT.

The tool I've chosen for reading EPUBs is the Python library ebooklib. This seemed to be a nice lightweight library for reading EPUBs. I also used DearPyGUI for showing this to the screen, because I figured why not, I like GUI libraries.

My first task was to find an EPUB file, so I downloaded one from my calibre server. I convert all my ebook files to .epub and .mobi on my calibre server so I can access them anywhere I can read my OPDS feed. I chose Throne of Glass (abbreviating to TOG.epub for rest of post). Loading I launched Python, and ran

>>> from ebooklib import epub
>>> print(book := epub.read_epub("TOG.epub")

This returned me a <ebooklib.epub.EpubBook object...> , seeing I had an EpubBook I ran a dir(book) and found the properties available to me

['add_author', 'add_item', 'add_metadata', 'add_prefix',
 'bindings', 'direction', 'get_item_with_href', 'get_item_with_id',
 'get_items', 'get_items_of_media_type', 'get_items_of_type',
 'get_metadata', 'get_template', 'guide',
 'items', 'language', 'metadata', 'namespaces', 'pages', 'prefixes',
 'reset', 'set_cover', 'set_direction', 'set_identifier', 'set_language',
 'set_template', 'set_title', 'set_unique_metadata', 'spine',
 'templates', 'title', 'toc', 'uid', 'version']

Of note, the get_item_with_X entries caught my eye, as well as spine. For my file, book.spine looks like it gave me a bunch of tuples of ID and a "yes" string of which I had no Idea what was. I then noticed I had a toc property, assuming that was a Table of Contents, I printed that out and saw a bunch of epub.Link objects. This looks like something I could use.

I will note, at this time I was thinking that this wasn't …

[read post]




Older Blog Posts


Tyrel's Blog

Code, Flying, Tech, Automation