It took a while to surface, but over the weekend I finally spotted the obvious possibilities raised by Google Spreadsheet’s new xml-linking features. Abi and I currently enter isbn, title and author data into our bookselling spreadsheet by hand, and this gets old quickly. More so for Abi, who has to check the individual Amazon pages to gauge current prices. I knew of Amazon’s API, and was planning to create something to help speed this up, but the only programming language I was ever any good at was Visual Basic, which isn’t around much any more, and I’m seriously rusty at the little .NET I once knew. So I wasn’t quite sure of the best way to approach things, and GS’ new hotness seemed like an ideal solution, if I could get it to work. Happily it did, and I can now enter an ISBN and have GS download the author, title, amazon URL, amazon price and lowest used price data automatically. Here’s a basic way to set it up:
- If you haven’t got one, sign up for an Amazon Web Services account and find your Access Key. This’ll let you make 1 request per second, although in practice this is fairly flexible.
- Take a look at the Amazon ECS API, and determine the appropriate request. It’s a web service, so supplies data after a standard browser request. Here’s the query I’m using, with an example book:
http://webservices.amazon.co.ukonca/xml?Service=AWSECommerceService &SubscriptionId=[your_ID] &ResponseGroup=Medium &SearchIndex=Books &Operation=ItemLookup &IdType=ISBN &ItemId=0593055489
The most important part of this is the ResponseGroup, which determines how much data Amazon supplies. Medium is a good all-rounder, but there are plenty of more specific request types. Specifying ISBN as the IdType allows the use of 13-digit ISBNs, which a standard ItemLookup doesn’t – Amazon’s documentation claims it doesn’t work on the UK site, but it does.
- Test out the request in your browser window, and make sure it’s supplying all the necessary data.
- Now it’s time to figure out the XPath statements you’ll need to extract the relevant info. If, like me, you haven’t the foggiest idea what XPath means, don’t worry. It’s pretty easy, and the W3Schools tutorial takes 10minutes to explain everything you’ll need.
- In this case, I want to extract the Amazon price, the lowest used price, the URL of the detail page, the (first) author and the title:
- The price I want is in the FormattedPrice element, underneath the ListPrice node. There’s more than one ‘FormattedPrice’ element in the results, so I need to specify its parent node. The appropriate XPath statement is “//ListPrice/FormattedPrice”
- The lowest used price is the same element, but under LowestUsedPrice, so: “//LowestUsedPrice/FormattedPrice”
- The URL is in the “DetailPageURL” element, and there’s just the one of these, so: “//DetailPageURL”
- Similarly with title: “//Title”
- There may be multiple authors, and grabbing all of them would mess up the spreadsheet formatting we’ll come to in a minute. So I’ll just grab the first one: “//Author”. The number should, I think, be a zero to adhere to the standards. But, it isn’t – zero doesn’t actually work. Go figure.
These can all be combined into one XPath statement with | operators, so the final result is:
“//ListPrice/FormattedPrice | //LowestUsedPrice/FormattedPrice | //DetailPageURL | //Author | //Title”
- Ok, fire up a spreadsheet. Create a column for ISBN, price, lowest price, url, author, title and finally another for processing.
- GS uses the ‘importXML’ function to import data. This takes two arguments – the URL and the XPath statement. Enter an example ISBN, then, in the ‘processing’ column:
=importXML(“http://webservices.amazon.co.uk/onca/xml?Service=AWSECommerceService&SubscriptionId=[your_ID]&ResponseGroup=Medium&SearchIndex=Books&Operation=ItemLookup&IdType=ISBN&ItemId=”&B2,”//ListPrice/FormattedPrice | //LowestUsedPrice/FormattedPrice | //DetailPageURL | //Author | //Title”))
The &B2 at the end of the URL appends the ISBN number, so be sure to change this to the appropriate cell.
- Google should now go fetch the XML file, process it according to your XPath statement and…dump all the data into the ‘processing’ column. If you click on the cells containing the data you’ll see they have ‘Continue’ formulae. This (currently undocumented) formula simply takes data from the array stored in the first cell. So “=CONTINUE(H2, 5, 1)” shows the data from the 1st child (if it exists) of the 5th item of the results of the XPath processing – the book Title. Copy and past these statements to the appropriate cells on the correct row.
- We need to stop Google filling the subsequent rows, so enter a random text string in the second row of the processing column, then enter a different ISBN into the first row to refresh the query. Google will again fetch the data, then ask if you want to overwrite existing data. Say no, and it’ll only populate the Continue formulae we just set up on the correct row.
- CTRL-D the formulae down a few rows, and voila, you’ve got an easy-to-use ISBN query spreadsheet that uses one query per row. If you want to tidy it up, wrap it in an if statement like:
isblank() doesn’t seem to work properly with cells that have had content deleted.
Problems with this method:
- Queries are currently limited to 50. And that’s 50 queries present, rather than active – hiding them behind an if statement based on the contents of another cell doesn’t work. This is easy to get around via copy-and-pasting, but could be annoying in some circumstances.
- Queries update every two hours. This is unnecessary for this purpose, and I’d personally like to see unlimited one-time only queries that don’t update every 2 hours; hopefully that’ll come.
- Amazon theoretically have a 1 query-per-second limit. I’ve actually dumped 25 ISBNs at once and had results appear instantly, so I think they must have a flexible policy, but it’s a little risky and I don’t want my account suspended. This is another reason I’d like a one-time-only query option. It’s probably wise to play it safe by cutting-and-pasting the data into another table before closing the sheet.
- If Amazon doesn’t have a particular datum, the XPath query will be in the wrong order. For example, some annuals and readers digest books have no author listed, so the ‘Continue’ statement that should refer to the 4th field (in this case, the author) instead refers to the title. I can’t think of a way around this without multiple queries.
Hopefully I’ll be able to refine this in the future, but it’s not a bad start, and I’m very impressed that the functionality exists at all.