PDF to Text Conversion

This project combines a couple of well trodden paths: PDF to text conversion, and then running an app in the background with audio playback. It introduced some new concepts to me and, based on a trawl of the usual resources for problem-solving, at least a couple of issues that are worth recording.

The TL;DR version is that PDF parsing gets into pretty complicated territory – unless you happen to know C well. There are open source libraries out there, but they didn’t hit the mark for me. I’ve implemented my own parser which is crude, but works. More or less!

Any PDF manipulation in iOS is going to depend on the Quartz 2D library somewhere along the line. Whether you call it directly or rely on another API that wraps it is a matter of choice. I looked at a couple. PDFKitten has a lot of functionality and seems to be by far the most sophisticated open source library but the documentation didn’t cover the simple requirement that I had – text extraction. There’s another one called pdfiphone that I struggled to get to work, and which epitomises the main challenge that I had with this project: I have only a rudimentary knowledge of C, which is what you’re getting into with Quartz.

So the basic structure of a PDF is a series of tags associated with different types of payload. You break the document down into pages, and process each page as a stream, calling C functions associated with the tags. This is a simple adaptation of example code straight from the Quartz documentation:

for (NSInteger thisPageNum = 0; thisPageNum < numOfPages; thisPageNum++)
   CGPDFPageRef currentPage = CGPDFDocumentGetPage(*inputPDF, thisPageNum +1);
   CGPDFContentStreamRef myContentStream = CGPDFContentStreamCreateWithPage (currentPage);
   CGPDFScannerRef myScanner = CGPDFScannerCreate (myContentStream, myTable, NULL);
   CGPDFScannerScan (myScanner);
   CGPDFPageRelease (currentPage);
   CGPDFScannerRelease (myScanner);
   CGPDFContentStreamRelease (myContentStream);
   CGPDFOperatorTableSetCallback(myTable, "TJ", getString);

In the last line I call my own C function ‘getString’ when the stream encounters the tag “TJ”. Here’s the first part that was new to me: the blending of C and Objective C. My function call, which is an adaptation of code I found here, looks like this:

void getString(CGPDFScannerRef inScanner, void *userInfo)
   CGPDFArrayRef array;
   bool success = CGPDFScannerPopArray(inScanner, &array);
   for(size_t n = 0; n < CGPDFArrayGetCount(array); n += 1)
      if(n >= CGPDFArrayGetCount(array))
      CGPDFStringRef string;
      success = CGPDFArrayGetString(array, n, &string);
         NSString *data = (__bridge NSString *)CGPDFStringCopyTextString(string);
         [globalSelf appendMe:data];

So there a couple of things going on here: this code is simply looking for a string – well a Quartz style CGPDFStringRef – in the payload passed in by stream process. If it finds one, it converts it into an NSString via some ‘bridge casting’ – something I’ve come across before in working with the keychain, and which you need for ARC compliance. I then take that string and append it to a property in a local method called appendMe.

It’s not possible to call ‘self’ from a C function. There are a number of possible ways around this, some of which get pretty nasty. The most elegant that I found was this:

static P2VConverter *globalSelf;

   globalSelf = self;

…which assigns an instance of the class that I created to do the PDF processing to a static variable called *globalSelf, and which I can then refer to as an alternate to self. To say this implementation isn’t particularly memory efficient is an understatement – but it works.

There is a rich set of tags defined by a published PDF standard – all 800 pages of it – that tell whatever is rendering the document what to do with it. The best general explanation I found was this. There is a relatively small set of text related tags and TJ seems to be the simplest. It’s also the only one that I was able to adapt from other examples. I may come back to this again.

The way I tested this was to convert an HTML page into a PDF using Safari. The more complicated the text structure in your input document – say multiple columns per page, text boxes etc – the worse this simple extraction mechanism is going to cope.

On to email based importing of files. This isn’t something that I’d ever looked at before and it turned out to be a little more complicated than I expected. The amendments to the info.plist are pretty trivial, creating the association between the file type and the app. So in the PDF reader, when you launch the app in the contextual menu, what actually happens is that the file is copied to the app’s Documents folder, and a file:// style URL which points at it is passed to the AppDelegate – specifically, the application:(UIApplication *)application handleOpenURL method. I’d assumed in the first instance, that I’d import the header for the viewcontroller into the AppDelegate – and this is a single view app, so ViewController.h – instantiate the VC, call a method I expose in the header and I’d be done:

ViewController *thisVC = [[ViewController alloc] init];
[thisVC importFromEmail:url];

This is wrong, and led to some peculiar side effects, which emerged when I started to try to set the point in the text to resume speech to reflect a change in the scrubbing control. This is what I came up with, which is a variant of this, adapted for a single view app:

UIStoryboard *storyboard = [UIStoryboard storyboardWithName:@"Main" bundle:nil];
UINavigationController *root = [[UINavigationController alloc]initWithRootViewController:[storyboard instantiateViewControllerWithIdentifier:@"EntryVC"]];
self.window.rootViewController= root;
ViewController *thisVC = (ViewController*)[[root viewControllers] objectAtIndex:0];
if (url != nil && [url isFileURL]) 
   [thisVC importFromEmail:url];
   NSLog(@"url: %@", url);

A couple of points of note. First, the method being called here is going to run before  viewDidLoad or viewWillAppear. I normally do various inits in viewWillAppear, so I put them in a method that I call immediately in importFromEmail. Second, the string value for instantiateViewControllerWithIdentifier needs to be set in the storyboard.

Apart from the nasty callouts to C, what I spent most of my time working on scrubbing control functionality. I’ve created an IBAction method that will be called when the scrubber moves position. In the storyboard, I set the maximum value of the control to be 1, so to get the index of the new word position after dragging the control, I multiply the fraction that moving the control allows me to reference in the IBAction by the length of the original pdf string length.

Having stopped – not paused; more on that in a second – the playback, I then start a new playback in the IBAction method for the play button, having created a new string based on a range: the word index from the scrubber control as the start point, and then the length by subtracting that from the original pdf string length. There was a little bit of twiddling necessary to support this so that it would work when called multiple times.

Part of the reason I took this approach was because the continueSpeaking method on the AVSpeechSynthesizer class didn’t seem to work. This was because I was using stopSpeakingAtBoundary instead of pauseSpeakingAtBoundary – something I’ve just noticed. Doh!

This has a knock-on effect, which is that the play button has to be stateful, with a continue if the pause was pressed, or a restart with the new substring if it’s because of the setting of the scrubber control. Given that the actual quality of the string conversion is pretty basic, the fix for this exceeds the usefulness of the app.

A couple of final comments. I discovered that it’s best to keep the functionality in the IBAction method called for the scrubbing control changes pretty simple: basically just setting the property for the new word position. I got some peculiar results when trying to do some string manipulation for a property, because the method was being called before a prior execution to set it was completed.

Lastly, I encountered an odd, and as yet unresolved bug when, as an afterthought, I added support for speech based on text input directly into a UITextField. I started seeing log errors of the type _BSMachError: (os/kern) invalid name (15). This appears to be quite a common one, as the number of upvotes on this question attest to. I’m filing it in the same bucket as the stateful play button resolution: if the quality of the playback warranted it, I’d figure it out.

I was in two minds as to whether or not to write this app up given the mixed results, but I thought I might as well as it may be one of my last pieces of Objective C development.

I blogged a year ago, almost to the day, about my first Apple Watch app. I had one pretty serious watch project last summer, which I ended up canning around November. The idea was to sync accelerometer data from the watch with slow motion video, say of a golf swing. However, I ran into a problem with the precision of the data in AVAsset metadata, and which I have an as-yet unanswered question on in StackOverflow.

I also ran into the same usage issues with the watch that have been widely reported in the tech press. While I really liked the notifications [and the watch itself, which is a lovely piece of kit], the absence of any other compelling functionality barely warranted the bother of charging the thing. The borderline utility really came to the forefront for me when I was travelling, for both work and holidays, with no roaming network access. The watch stayed at home.

I also think it’s pretty telling that I bought precisely zero apps for it before I sold it a couple of months ago.

I’ll be looking to replace my current iPhone 6 Plus later this summer and I’m toying with the idea of moving to Android. I tried it before and hated it, but that was with a pretty low-end phone that I got as a stopgap after my 4S had an unfortunate encounter with the washing machine. Java would be much more useful than Objective C in my working life, and it’s a potentially interesting route into the language.