Today we published data on approximately 1.8 million items loaned from the University of Lincoln’s libraries since 2001. The data is available to re-use under a CC0 licence, and can be downloaded from:
We’ve done this as part of our involvement in the Copac Activity Data Project, a.k.a. SALT2. Along with data from the universities of Manchester, Sussex, Cambridge and Huddersfield, our circulation data will be used to power a ‘recommender API‘, which libraries will be able to use to build “People who borrowed X also borrowed Y“-type services. The API will benefit from the power of aggregated data from multiple institutions of different types, containing tens of millions of circulation events.
You’ll notice as well that we’ve chosen to host the data on our brand-new Orbital (v0.1) research data management application. Each dataset has a persistent citable URI. We’ll be keeping the data up-to-date, and generating a new activity data file from our library circulation logs shortly after the end of each academic year.
The data consists of a number of CSV files (one for each academic year since 2000-01, plus a huge file of all the data), containing the following fields:
|Field index||Field name||Description|
|0||CREATE_DATE||The date and time of the loan event, in the format: dd/mm/yyyy hh:mm|
|1||BORROWER_ID||A cryptographic hash of the internal system ID associated with the borrower of the item, as used in the University of Lincoln’s library system.|
|2||WORK_ID||A cryptographic hash of the internal system ID associated with the bibliographic work borrowed, as used in the University of Lincoln’s library system.|
|3||CONTROL_NUMBER||The ISBN of the work borrowed (10 or 13 digits).|
|4||AUTHOR_DISPLAY||The main author of the work borrowed.|
|5||TITLE_DISPLAY||The title of the work.|
|6||PUB_DATE||The publication year of the work in the form: yyyy|
I’ll blog in detail another time about exactly how we created the data extracts. In short:
- There is a table in the SirsiDynix Horizon library management system called circ_tran which records every instance of item number X borrowed by user number Y at time Z. [#1]
- There is another table which provides a lookup between item numbers and the numbers of the bibliographic works of which they are a copy. [#2]
- Dave Pattern at the University of Huddersfield wrote a Perl script which scrapes all the bibliographic data (title, author, ISBN) for each work from our OPAC (Horizon Information Portal) and writes it to a text file. [#3]
- Developer, Jamie Mahoney of CERD/LNCD then stepped in, using some pretty heavy SQL on the original 3 data extracts, to:
- Hash the internal Horizon user and work ID numbers to provide anonymity;
- Convert the internal Horizon date and time stamps in extract [#1] from a version of Unix time into a readable datestamp (formula hint: cko_date*86400 + cko_time*60);
- Used the item/work lookup table [#2] to pull in the bibliographic details for each loan in [#1] from the bibliographic table [#3] (an epic SQL JOIN query), removing items which are no longer represented in our library system;
- Removed any items without an ISBN, which are of no use to the SALT recommender API;
- Tweaked the punctuation and formatting;
- Split the data into separate files for each year.
Once again, the data is at:
Thanks are due to Chris Leach and Dave Pattern for Horizon-fu, and to Jamie Mahoney for his patient wrangling of several millions of lines of data!
You can find out more about the Copac Activity Data Project/SALT2, at: http://copac.ac.uk/innovations/activity-data/