Working With Fixed Width Text Files

Yesterday I attended GovLoop’s “Open Data Event” and was inspired to revisit a project I’d set aside a few months ago . . . to try a different approach.

For those who don’t know, the Fairfax County Police Department posts a weekly “data dump” of the tickets issued and arrests made.  I know Fairfax Underground has created a search engine based around a collection of this data, but it’s little more than a novelty item to see if anyone you know (e.g. friend, neighbor, blind date, etc.) has been arrested or ticketed.  I haven’t seen anyone plot the original data set on a Google Map, let alone mashup with other data sets such as real estate prices or school test scores. It’s become a curiosity for me, to see if something more significant can be done with that information.

One frustrating challenge with this particular data file is that the data is stored in a fixed width text file– a very generic format, but not as immediately useful as XML or JSON, for instance.  You have to write code that tells your application to use characters 1-40 for column 1 data, use characters 41-60 for column 2 data, and so on . . . plus you have to hard code the names and quantity of the data fields, and that just didn’t seem like a particularly robust situation for me.

I wanted to create a php function that could, for lack of a better term, “read and analyze” any fixed width text file and return the names of the fields as well as the start positions for each field dynamically. I’ve written before about the struggle to come up with reusable algorithms that we can apply to different variations of the same basic principles, but this one took longer than I thought.

I won’t rehash the different approaches and failures, and jump to the approach that finally worked for me.

 /* let's open our fixed width text file */
 $handle = @fopen("arrests.txt", "r");

 if ($handle) {
 // read our first line
 $headerRow = fgets($handle, 4096);
 
 /* parse for header row names, using TWO
  or more spaces for RegEx pattern */
 $headerRowNames = preg_split('/(?:\s\s+|\n|\t)/', $headerRow, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_OFFSET_CAPTURE);

// display array contents for examination
 echo "<pre>";
 var_dump($headerRowNames);
 echo "</pre>";

Basically, we grab the first line of the file and use a regular expression of two or more consecutive spaces to figure out where one header row name ends and the next one begins. Since one of the header rows in my sample file has a space in it (i.e. Charge Description), we can’t use a single space as the delimiter. NOTE: There is an edge case where a header row name could be so long that only a single space separates it from the next header row name; so it’s not 100% “bulletproof”, but this was as close I could get to a solution with only one evening of work.

Here are the results when we display the array:

array(8) {
  [0]=>
  array(2) {
    [0]=>
    string(5) "LName"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(5) "FName"
    [1]=>
    int(40)
  }
  [2]=>
  array(2) {
    [0]=>
    string(5) "MName"
    [1]=>
    int(60)
  }
  [3]=>
  array(2) {
    [0]=>
    string(3) "Age"
    [1]=>
    int(100)
  }
  [4]=>
  array(2) {
    [0]=>
    string(7) "DateArr"
    [1]=>
    int(105)
  }
  [5]=>
  array(2) {
    [0]=>
    string(6) "Charge"
    [1]=>
    int(135)
  }
  [6]=>
  array(2) {
    [0]=>
    string(15) "Charge Descript"
    [1]=>
    int(160)
  }
  [7]=>
  array(2) {
    [0]=>
    string(7) "Address"
    [1]=>
    int(210)
  }
}

This gives us a multidimensional array with both the field names contained in the header row and their offset positions, which will certainly come in handy as we process the remaining lines of the file. Time permitting, I’d like to drop this as a method into a php class, alongside methods to convert the extracted data into either a JSON or XML file. That way, I’ll have a utility class that can add value to other fixed width text files in the future.

LyndaCertificates

Lynda.com Courses Completed

MacOSXMavericksNewFeatures_CertificateOfCompletion

LinkedInEssentialTraining_CertificateOfCompletion

CMSFundamentals_CertificateOfCompletion

UpandRunningwithSquarespace_CertificateOfCompletion

ScreencastingwiththeMac_CertificateOfCompletion

SQLEssentialTraining(2009)_CertificateOfCompletion

ColdFusion9NewFeatures_CertificateOfCompletion

YouTubeEssentialTraining(2011)_CertificateOfCompletion

Google+FirstLook_CertificateOfCompletion

jQueryEssentialTraining_CertificateOfCompletion HTML5-ManagingBrowserHistory_CertificateOfCompletion

AchievingYourGoals_CertificateOfCompletion

SettingUpPHP5forWindowswithApacheorIIS_CertificateOfCompletion

HTML5-DocumentEditinginDepth_CertificateOfCompletion

SettingUpPHP5forWindowswithApacheorIIS_CertificateOfCompletion

WordPressMobileSolutions_CertificateOfCompletion

UpandRunningwithBootstrap2_CertificateOfCompletion

¿Hablas history.pushState?

I’m not a fan of “Orange Is The New Black”, but I like this particular scene because it makes an important point: not everyone who looks Hispanic speaks Spanish, and vice versa.

Sometimes I think we try to be too “clever”, making inferences about another person’s capabilities based upon their appearance, their surnames, etc. Pro tip for you– the most reliable way to find out if someone speaks Spanish is to ask them: “¿hablas español?” If they look blankly at you and don’t respond, there’s no point in saying anything further in Spanish.

Likewise, we can be “too clever” when using JavaScript to access features in any given web browser.  I’ve seen web developers use clever tricks to determine which version of Internet Explorer, for example, they are dealing with, and then come up with a workaround for the cool thing they are trying to do. And it even works for a while– until the next version of Internet Explorer comes out and breaks the clever “which version of IE am I dealing with here?” trick they have been relying upon. And that’s when you start seeing code that works in Internet Explorer 8, but breaks in Internet Explorer 9.

Wouldn’t it be nice if we could just ask the browser, “Do you support the history.pushState feature?”  Actually, we can do just that, and it looks like this:

if (history.pushState) {
    history.pushState('', document.title, window.location.pathname);
}
else {
    window.location.hash = '';
}

because "all web professionals must have a website if they want to be taken seriously."