A Python Disaster

I had 30 docx files, each between 300 and 400 pages long (so between 9,000 and 12,000 pages total), with random headings thrown in and also bolded text that is supposed to be a heading but is, of course, like most things in life, completely without style.

My boss asked me to go through and apply heading styles only to the bolded text, and otherwise remove the stylings applied.

So, I could’ve gone in and done this manually, opened each of the 30 files and done a find and replace and apply style. OR:

A photo of a python (the animal, not the programming language) Python time! (The age of this python is a metaphor for my current abilities with the programming language. By Tigerpython (Own work) CC BY-SA 3.0, via Wikimedia Commons)

It took longer to make and run the script than it would’ve to have done it all manually (mostly because I made little errors).

And here’s the script:

import os, pypandoc, re

curDir = os.getcwd()

for file in os.listdir(curDir): 
	if file.endswith('.docx'):
		fileName = os.path.splitext(file)[0]
		output = pypandoc.convert_file(file, 'md', outputfile= fileName + "-FA.md")
		print file + " has been converted to markdown"
        #left myself console notes so I knew where I was; 
        #like I said, this script took a really long time to run

for mdFile in os.listdir(curDir): 
	if mdFile.endswith('.md'):
	    #now we have our markdown files,
    	#which are a little easier to parse than docx
		mdFileName = os.path.splitext(mdFile)[0]
		mdFile_opened = open(mdFile)
		mdFile_contents = mdFile_opened.read()
		mdFile_opened = open(mdFile, 'w')
		regex = "\n\*\*(.*?)\*\*\n"
		subst = r"\n# \1\n"
		mdFile_contents = re.sub(regex, subst, mdFile_contents)
        #replace bolded text on its own line
        #with h1's on their own lines
		print "Headings inside of " + mdFile + " adjusted!"
		revertdocx = pypandoc.convert_file(mdFile, 'docx', outputfile= mdFileName + ".docx")
		print mdFile + "converted back to DOCX"

Not sure where it goes wrong, but it does go wrong. It ran for 20 minutes (thankfully giving me updates of its process due to the helpful print to console commands I put in there). After the script stalled out on one of the files, I stopped and went with a manual approach. It was actually faster than I thought find/replacing on bold and applying H1, and then promoting and demoting headings to appropriately nest them. Only took like 10 hours. No big deal. Most of that time was spent wondering whether something was a sub-section of another.

Moral of the story: Make authors and editors determine heading hierarchy.


22 May 2017 | Painting | Podcasting | Franco A. Alvarado

This one was going to be about maps, but I accidentally painted a painting! Isn’t that fun? So as not to keep you in suspense,...


15 May 2017 | Meta | Organization | Franco A. Alvarado

I believe I am still adjusting to this new way of working, this new way of living. I had a very productive weekend at the...


08 May 2017 | Meta | Writing | Franco A. Alvarado

I had a lot of freelance work to do this week. It made me think about whether I should quit doing this blog, but then...

Project Rotation

01 May 2017 | Python | Jekyll | Meta | Franco A. Alvarado

I have extra-curricular projects I like to work on, in additon to projects at work where I am a project manager and my freelance projects...

Organizing Metadata in YAML

24 April 2017 | YAML | Metadata | Franco A. Alvarado

I have organized a large amount of content into a YAML document. This will aid in the contact automation I was talking about before. For...

A Python Disaster

17 April 2017 | Python | Pandoc | Regular Expressions | Franco A. Alvarado

I had 30 docx files, each between 300 and 400 pages long (so between 9,000 and 12,000 pages total), with random headings thrown in and...

Yet another scheduling Jekyll posts post

10 April 2017 | Jekyll | Python | Meta | Franco A. Alvarado

I’ve been trying to research how to schedule these Jekyll blog posts so I am one of those great Internet unicorns that regularly updates their...

Automating contracts

03 April 2017 | LaTeX | Pandoc | docx | Franco A. Alvarado

I have to juggle multiple authors and their contracts, so I thought it’d be better to automate those contracts so I can prevent errors and...