Sunday fun - Scraping jobs data

Instead of doing sensible Sunday things like laundry, shopping or cooking, I've spent the morning working on a script for retrieving and processing jobs data. I'm hoping to retrieve this data on a weekly basis in order to track changes in the Irish jobs market over a period of a year. By comparing the number of jobs in different sectors, it will be possible to see which sectors are growing and which are not. There is a vast amount of interesting data available for researchers on the internet, but automating retrieval and processing of this data requires some technical skills unavailable to most. I've been arguing for a while that social scientists should be learning these skills in order to take advantage of the potentials for research and I'd like to present you with a simple example of what's possible. The source: have a wonderful front page which lists all the categories of jobs and the number of jobs available in each. Displayed in simple html format, this is a perfect source for retrieval.

The script:

Since most of my scripting knowledge is based on bash (the console native to linux) I decided to write the script based on bash utilities. First though, I had to install a bash emulator on my Windows machine, as I don't really know how to use the Windows Command Line Utilities. This emulator is called cygwin - a really great program which creates a bash environment within Windows.

Using cygwin, I then began creating the script. First step was retrieval. To retrieve the front page from I used wget and saved the output to file. Of course, this output is in html format, which means that it needs to be processed before it can be used for analysis. To format the file, I first used grep to select the relevant part of the page. Then I used a string of sed commands to remove html tags and insert tabs and new lines to prettify the data. Shout out goes to this deadly little sed command which removes all html tags from a file. Once I had tested these two steps, I needed to create a shell script in order to set it up as a recurring task. Once this was done, I used Windows Task Scheduler to set the task running on a weekly basis.

Note that I probably wouldn't recommend using bash as a scripting tool in the way that I have done. Using it requires stringing together several different utilities, each with different options and syntaxes. Far better would be to use a single language with a purpose built module (such as Beautiful Soup in Python). I used BASH simply because I'm familiar with it; for future projects, it would probably be well worth while investing the extra time in using a more elegant solution such as Python.

The caveat: Obviously I'm not arguing that this methodology is the ultimate in social science research. As with any methodology, it has strengths and weaknesses. Here are a few problems that spring to mind:

1) Data is king - Research is very strongly guided by availability. If sites do not make data available, it is impossible to retrieve it. For example, does not present jobs data in this easy to process manner. Cross-referencing data from multiple sources would require significantly more work.

2) In their hands be it - Since one is retrieving data from a third party with whom one has no direct interaction, it is very possible for that third party to change the presentation of their data in such a way as to make retrieval impossible. For example, if were to change the format of their site, the retrieval script would be rendered useless.

The next step:

So now I have a retrieval script running on a weekly basis for so long as I choose. Is anyone interested in doing anything with the data? I'd like to try and visualise the data, but it would also be interesting if anyone was interested in using it as the basis of a paper.