17 April 2017 |
Python | Pandoc | Regular Expressions |
Franco A. Alvarado
I had 30 docx files, each between 300 and 400 pages long (so between 9,000 and 12,000 pages total), with random headings thrown in and also bolded text that is supposed to be a heading but is, of course, like most things in life, completely without style.
My boss asked me to go through and apply heading styles only to the bolded text, and otherwise remove the stylings applied.
So, I could’ve gone in and done this manually, opened each of the 30 files and done a find and replace and apply style. OR:
Python time! (The age of this python is a metaphor for my current abilities with the programming language. By Tigerpython (Own work) CC BY-SA 3.0, via Wikimedia Commons)
It took longer to make and run the script than it would’ve to have done it all manually (mostly because I made little errors).
And here’s the script:
Not sure where it goes wrong, but it does go wrong. It ran for 20 minutes (thankfully giving me updates of its process due to the helpful print to console commands I put in there). After the script stalled out on one of the files, I stopped and went with a manual approach. It was actually faster than I thought find/replacing on bold and applying H1, and then promoting and demoting headings to appropriately nest them. Only took like 10 hours. No big deal. Most of that time was spent wondering whether something was a sub-section of another.
Moral of the story: Make authors and editors determine heading hierarchy.