Tyrel's Blog

Code, Flying, Tech, Automation

Oct 16, 2022

New Blog - Pelican!

If you have read the previous post, and then looked at this one, there are a LOT of changes that happened. I was recently exploited and had heysrv.php files everywhere, so I have decided to forego wordpress for now. I am now using Pelican!

It's very sleek, and only took me a few hours to port my Wordpress export to Pelican reStructuredText format.

All I have to do is run invoke publish and it will be on the server. No PHP, no database. All files properly in their right places.

It comes with your standard blogging experience: Categories, Tags, RSS/Atom feeds, etc. You need to set up Disqus — which I probably won't — in order to get comments though.

I'm pleased with it. I have posts go under YYYY/MM/slug.html files, which I like for organization. Posting images is easy, I just toss it under content/images/YYYY/MM/ with date for organization.

 · · ·  python  pelican

Jun 01, 2022

Writing an EPUB parser. Part 1

Parsing Epubs

Recently I've become frustrated with the experience of reading books on my Kindle Paperwhite. The swipe features, really bother me. I really like MoonReader on Android, but reading on my phone isn't always pleasing. This lead me to look into other hardware. I've been eyeing the BOOX company a while ago, but definitely considering some of their new offerings some time. Until the time I can afford the money to splurge on a new ebook reader, I've decided to start a new project, making my own ebook reader tools!

I'm starting with EPUBs, as this is one of the easiest to work with. At its core, an EPUB is a zip file with the .epub extension instead of .epub with many individual XHTML file chapters inside it. You can read more of how they're structured yourself over at FILEFORMAT.

The tool I've chosen for reading EPUBs is the Python library ebooklib. This seemed to be a nice lightweight library for reading EPUBs. I also used DearPyGUI for showing this to the screen, because I figured why not, I like GUI libraries.

My first task was to find an EPUB file, so I downloaded one from my calibre server. I convert all my ebook files to .epub and .mobi on my calibre server so I can access them anywhere I can read my OPDS feed. I chose Throne of Glass (abbreviating to TOG.epub for rest of post). Loading I launched Python, and ran

>>> from ebooklib import epub
>>> print(book := epub.read_epub("TOG.epub")

This returned me a <ebooklib.epub.EpubBook object...> , seeing I had an EpubBook I ran a dir(book) and found the properties available to me

['add_author', 'add_item', 'add_metadata', 'add_prefix',
 'bindings', 'direction', 'get_item_with_href', 'get_item_with_id',
 'get_items', 'get_items_of_media_type', 'get_items_of_type',
 'get_metadata', 'get_template', 'guide',
 'items', 'language', 'metadata', 'namespaces', 'pages', 'prefixes',
 'reset', 'set_cover', 'set_direction', 'set_identifier', 'set_language',
 'set_template', 'set_title', 'set_unique_metadata', 'spine',
 'templates', 'title', 'toc', 'uid', 'version']

Of note, the get_item_with_X entries caught my eye, as well as spine. For my file, book.spine looks like it gave me a bunch of tuples of ID and a "yes" string of which I had no Idea what was. I then noticed I had a toc property, assuming that was a Table of Contents, I printed that out and saw a bunch of epub.Link objects. This looks like something I could use.

I will note, at this time I was thinking that this wasn't the direction I wanted to take this project. I really wanted to learn how to parse these things myself, unzip, parse XML, or HTML, etc., but I realized I needed to see someone else's work to even know what is going on. With this "defeat for the evening" admitted, I figured hey, why not at least make SOMETHING, right?" I decided to carry on.

Seeing I was on at least some track, I opened up PyCharm and made a new Project. First I setup a class called Epub, made a couple of functions for setting things up and ended up with

class Epub:
    def __init__(self, book_path: str) -> None:
        self.contents: ebooklib.epub.EpubBook = epub.read_epub(book_path)
        self.title: str = self.contents.title
        self.toc: List[ebooklib.epub.Link] = self.contents.toc

I then setup a parse_chapters file, where I loop through the TOC. Here I went to the definition of Link and saw I was able to get a href and a title, I decided my object for chapters would be a dictionary (I'll move to a DataClass later) with title and content. I remembered from earlier I had a get_item_by_href so I stored the itext from the TOC's href: self.contents.get_item_with_href(link.href).get_content(). This would later prove to be a bad decision when I opened "The Fold.epub" and realized that a TOC could have a tuple of Section and Link, not just Links. I ended up storing the item itself, and doing a double loop in the parse_chapters function to loop if it's a tuple.

def parse_chapters(self) -> None:
    idx = 0
    for _item in self.toc:
        if isinstance(_item, tuple):  # In case is section tuple(section, [link, ...])
            for link in _item[1]:
                self._parse_link(idx, link)
                idx += 1
        else:
            self._parse_link(idx, _item)
            idx += 1

_parse_link simply makes that dictionary of title and item I mentioned earlier, with a new index as I introduced buttons in the DearPyGUI at this time as well.

def _parse_link(self, idx, link) -> None:
    title = link.title
    self.chapters.append(dict(
        index=idx,
        title=title,
        item=self.contents.get_item_with_href(link.href)
    ))

That's really all there is to make an MVP of an EPUB parser. You can use BeautifulSoup to parse the HTML from the get_body_contents() calls on items, to make more readable text if you want, but depending on your front end, the HTML may be what you want.

In my implementation my Epub class keeps track of the currently selected chapter, so this loads from all chapters and sets the current_text variable.

def load_view(self) -> None:
    item = self.chapters[self.current_index]['item']
    soup = BeautifulSoup(item.get_body_content(), "html.parser")
    text = [para.get_text() for para in soup.find_all("p")]
    self.current_text = "\n".join(text)

I don't believe any of this code will be useful to anyone outside of my research for now, but it's my first step into writing an EPUB parser myself.

The DearPyGUI steps are out of scope of this blog post, but here is my final ebook Reader which is super inefficient!

final ebook reader, chapters on left, text on right

I figure the Dedication page is not as copywrited as the rest of the book, so it's fair play showing that much. Sarah J Maas, if you have any issues, I can find another book for my screenshots.

 · · ·  epub  python

Nov 05, 2021

Finished my GitHub CLI tool

I never intended this to be a full fleshed CLI tool comparable to the likes of the real GitHub CLI. This was simply a way to refresh myself and have fun. I have accomplished this, and am now calling this "Feature Complete". You can play around with it yourself from the repository on gitlab.

TESTING

With that accomplished, I then added pytest-cov to my requirements.in and was able to leverage some coverage checks. I was about 30% with the latest changes (much higher than anticipated!) so I knew what I wanted to focus on next. The API seemed the easiest to test first again, so I changed around how I loaded my fixtures and made it pass in a name and open that file instead. In real code I would not have the function in both my test files, I would refactor it, but again, this is just a refresher, I'm lazy.

I decided earlier that I also wanted to catch HTTP 403 errors as I ran into a rate limit issue. Which, I assure you dear reader, was a thousand percent intentional so I would know what happens. Yeah, we'll go with that.

Py.Test has a context manager called pytest.raises and I was able to just with pytest.raises(httpx.HttpStatusError) and check that raise really easily.

The next bits of testing for the API were around the pagination, I faked two responses and needed to update my link header, checking the cases where there was NO link, was multiple pages, and with my shortcut return - in case the response was an object not a list. Pretty straight forward.

The GHub file tests were kind of annoying, I'm leveraging rich.table.Table so I haven't been able to find a nice "this will make a string for you" without just using rich's print function. I decided the easiest check was to see if the Table.Columns.Cells matched what I wanted, which felt a little off but it's fine.

The way I generated the table is by making a generator in a pretty ugly way and having a bunch of repo['column'], repo['column'] responses, rather than doing a dict comprehension and narrowing the keys down. If I ever come back to this, I MIGHT reassess that with a {k:v for k,v in repos if k in SELECTED_KEYS} and then yield a dictionary, but it's not worth the effort.

Overall I'd say this project was fun. It gave me a glimpse back into the Python world, and an excuse to write a couple blog posts. My next project is to get a Django site up and running again, so I can figure out how to debug my django-dbfilestorage.

Closing Thoughts

If I had to do this again, I would probably have tried some test driven development. I've tried in the past, but I don't work on a lot of greenfield projects. I tend to be the kind of engineer who jumps HEAD FIRST into code and then tests are an after thought.

I also kind of want to rewrite this in Go and Rust, two other languages I've been fond of lately, just to see how they'd compare in fun. I haven't done any API calls with Rust yet, only made a little Roguelike by following Herbert Wolverson's Hands-On-Rust book. The Tidelift CLI is all Go and a bazillion API calls (okay like ten) so that wouldn't be too hard to use like SPF13's Cobra CLI library and make a quick tool that way.

One fun thing I learned while moving things over to GitLab is that my user Tyrel is a super early adopter. I was in the first 36,000 people! I showed a screenshot of my user ID to my friend Sunanda at GitLab and we had fun finding that out.

 · · ·  python  cli

Nov 04, 2021

Python3 GitHub CLI tool as a refresher

It's no lie that I love terminals. I wish I could live on a terminal and never really need to see a GUI application again.

Last night I migrated a lot of my old code from one GitLab account to another (tyrelsouza to tyrel) in an effort to clean up some of my usernames spread across the world. While doing that I noticed my django-dbfilestorage Python module that has been sitting and rotting for three years. I played around a little bit in order to port it to Python 3.9, but I ran into some base64 errors. I tried a little bit but it was late and I couldn't figure it out. My resolve is that I have been away from Python for too long so the little things - that I knew and love - had fallen away. I mentioned this to my friend Alex and he said "make a barebones github cli (readonly?) with issue viewer, and stats display". I've embarked on a journey to refresh my Python well enough to repair DBFS.

Me: "okay python frioends, what should I make as a quick refresher into the Python world?" alex: "maybe: barebonx github cli (reasdonly?) with issue viewer and stats display"

I knew I wanted to use httpx as my network client library, it's new, fast, and I have a couple friends who work on it. I started with a barebones requirements.in file, tossed in invoke, pytest, and black. From there I used pip-compile to generate my requirements.txt - (a tip I picked up recently while adding Pip-Compile support to the Tidelift CLI) and I was good to go.

The docs for the GitHub API are pretty easy to read, so I knew all I really needed to do was set my Accept header to be Version3 and I could view the schema. With the schema saved to a .json file I then wrote a GHub class to pull this data down using httpx.client.Client.get, super simple! The only two endpoints I care about right now are the user and repos endpoints, so I made two get_ functions for each. After a little bit of work - which I won't bore you with the super intricate details - I have a functional cli.py file. For now, the only interaction is a propmt from rich for a username, and then you get a fancy table (also from rich) of the first page of results of repos, stargazer/watchers/forks counts, and a description.

Prompting for the username and showing my table of repositories.

Prompting for the username and showing my table of repositories.

It was a fun evening of learning what's changed in Python3 since I last touched it, especially as I've spent the majority of my career in Python2.7. Type annotations are super awesome. I'll probably pick it up again once I get some more free time later in the week. It's also nice blog fodder! I already have a million things I want to do next - pagination, caching, some more interaction.

Showing py.test running

Showing py.test running

I know the tool I'm writing is nothing special, especially with their own cli now, but I'm not looking at reinventing the wheel!

Check out the code so far on my GitLab (heh, ironic it's there).

Dependencies: httpx, pip-tools, black, invoke, pytest, pytest-httpx, rich.

 · · ·  python  cli
← Previous Next → Page 2 of 4