2010-08-30 Monday
- Using the word vendor instead of bookseller
- Thinking about identifiers
2010-08-27 Friday
- Need to check abbyy: americanpsycholo00amer and americanpublicop00erik - answer: print disabled
- We could use the word 'supplier' instead of 'bookseller'
2010-08-12 Thursday
- Thinking about speeding up solr update using a cache of data from the Open Library database.
2010-07-29 Thursday
- Search walmart for books: http://www.walmart.com/search/search-ng.do?search_constraint=3920&search_query=ISBN
- http://www.booksamillion.com/product/0143038257
- http://search.barnesandnoble.com/mobile/e/9780143038252
2010-05-16 Wednesday
- Overdrive: 75810 books, 46139 with an original ISBN
- Amazon lookup code is in: ia331504:/2/edward/20century/avail_check.py
2010-06-08 Tuesday
- Fix search page titles
- Rebuild solr index
- Work subjects file for Anand
- Maybe make /search?ftokens=pfbyekbzsflz redirect to /subjects/social_life_and_customs
2010-06-04 Friday
- Invalid MARC XML: datafield tag="g050"
2010-06-01 Tuesday
Todo:
- Fix covers, load missing ones for print disabled books
- Make scribe handle works
- Move bits off ia331504
- Documentation
- Read e-mail
- Write biography
- Read CVs for new developer.
2010-05-11 Tuesday
- Sent fresh job description to George
- Trying to load missing Google books without scandate
- Stopped Unhanded Exception by disabling DataTable cookie
Breakdown of print disabled books not loaded:
2206 items that end with mbp and have no MARC XML
1983 duplicate scans
662 items with MARC that says they aren't books
288 items not include in the search engine because of a bug in infogami
Need to write mail explaining the situation with print-disabled and MARC records that say they're not books.
- Need to update subject index with Protected DAISY and Accessible books.
2010-05-04 Tuesday
- Found 53 editions that are made up of duplicates within the print disabled collection, sent a mail about it.
- Pool can't support new longer book keys, uses varchar(16), added code to turn them into old style book keys
Import server runs from /1/edward/src/openlibrary/openlibrary/catalog/importer/import_server.py like this: "python import_server.py 9020"
Running a bunch of code to load print disabled books:
- adding print disabled identifiers to existing records: 6.79%: 3615 of 53591 (saves using save_many every 50 books)
- adding the 'Accessible book' subject to works with scans: 6.35% complete (using save_many, every 100 works)
- loading new print disabled books: 2.89%: 776/26828 (not using save many, saves each book individually)
- solr_update is watching the log for changes and upating the search engine in chunks of 100 works
2010-05-03 Monday
- Waiting on Anand's database migration
- Downloading all printed disabled MARC, ready to analyse
- Might need to rewrite code to use new prefixes for books and authors
- Give priority to author merging
- Avoid loading books with bad dates in the distant future, use 260c instead
- Don't give spelling suggestions that have no results
- 2215 of the printdisabled books don't have MARC
- 639 print disabled books aren't monographs
- 77160 books in print disabled collection
- 74306 will be loaded
- Brewster asked about SFPL MARC records on openlibrary.org, not part of ol_data collection
- Alexis moved all data in the marcrecords collection to ol_data.
- Found 53 scanned books without files.xml, Hank says it is a permissions problem
- Redownloaded 53 scanned books with missing files.xml, now down to 43 books with missing files.xml
- Hank says: "Nearly all of these are on ia301516:/1/items where a secondary-to-primary disk rescue is in progress."
- Ralf says we need to replace a failing hard drive in ia331507
- I'm rebuilding the author index from an old author dump.
- No work record for 1984 by George Orwell.
- can't run work finder or solr update while JSON API is broken, it can't handle queries that include something like title=null.
- Jeff pointed out we are missing a an edition for Cibola by Alice Walworth Graham, I think it is because of the database migration.
- work Solr was missing textSpell fieldtype, always suggesting errors in searches not written in lowercase - fixed
- MARC record says books is an electronic resource when it isn't.
2010-04-30 Friday
- To load sfpl book descriptions I need to have mapping from source_record -> edition -> work
2010-04-29 Thursday
2010-04-28 Wednesday
Fix up subjects, dump commas:
- World War, 1939-1945 --> World War (1939-1945)
- Nigeria -- Civil War, 1967-1970 -> "Nigeria Civil War (1967-1970)" and "Nigeria"
Need to rebuild subject list from latest dump.
Get to use: from openlibrary.solr.work_subject import get_marc_subjects
Building fresh file of work subjects
Building file of changes for Anand
Work finder is running, but sometimes crashes, recording some logs, might help debugging
Solr update is running, it runs work finder for author merges
Downloading MARC and META XML for printdisabled books
2010-04-26 Monday
Mary pointed missing books:
Made HTML dump of Henrik Ibsen works to assist debugging of work finder.
Challenge with the work finder is splitting up these works:
- OL52404W: From Ibsen's workshop
- OL52412W: The collected works of Henrik Ibsen
2010-04-22 Thursday
Time to make an edition search index. 24012766 editions with authors in /3/edward/db_dump/edition_dump2
Need to add work description to search engine. Maybe add created and last modified.
After work finder runs make work updater a two step process
2010-04-21 Wednesday
Work finder has difficulty splitting existing works.
OL52397W should be split into three works:
- Brev 1845-1905
- Brev veksling med Christiania Theater, 1878-1899
- Henrik Ibsens brevveksling med Christiania Theater 1878-1899
- The correspondence of Henrik Ibsen
Add logging to work finder code for updating works.
Add a pager to subject and author searches.
We don't have ISBN 0001982370 in Open Library yet.
Author merge crash, on "assert cur['type'] == '/type/author_role'" - fixed.
Usergroup and /type/volume edit pages don't work on upstream.
Karen wants title and subtitle as two input boxes in librarian mode.
WorkBot messed up title of work
Need to avoid having commas in subjects, they get broken when saving.
Need to try and include place names in time subjects about wars.
we now have subtitle support on the work edit page
I just changed the title of 'Flatland' to 'Flatland: a romance of many dimensions'. When I hit save it gets split on the ': ' into title and subtitle, here is my edit:
http://upstream.openlibrary.org/works/OL118420W/Flatland?b=5&m=diff
Can't delete work with editions.
if delete:
if self.edition:
self.delete(self.edition.key, comment=comment)
if self.work and self.work.edition_count == 0:
self.delete(self.work.key, comment=comment)
return
multiple work titles: epistolaeadattic03ciceuoft
multiple work titles: epistolaeadattic00ciceuoft
multiple work titles: epistolaeadattic02ciceuoft
multiple work titles: epistolaeadattic04ciceuoft
multiple titles: resoflegkentucky00kent
multiple titles: thermodynamicsh05woodgoog
multiple titles: ueberdiesprache00burmgoog
multiple titles: volume00goog
multiple titles: worksofrightrevb00strauoft
multiple titles: yesterdaytodaya08bickgoog
Mad MARC
- Big TOC
- LC classification: serial has 050, but only 050d
MARC code breakdown:
{'`': 1, 'a': 188926, ' ': 5, 'c': 921, 'b': 153, 'e': 112, 'g': 6, 'i': 1, 'k': 24, 'j': 17, 'm': 1, 'p': 488, 's': 8, 't': 967, 'v': 226, 'y': 179, 'x': 408}
{'a': 598, ' ': 830, 'c': 284, 'b': 13, 'd': 63, 'm': 110279, 's': 80352, 'p': 15, 'S': 9}
18920 works_with_bad_subjects in /3/edward/db_dump
http://upstream.openlibrary.org/search?q=%22thesis+for+BA%22
archive.org metadata table is sometimes out of date. Search engine is more current.
There are 8402 scanned books that have 'MARC Source' in the format field, but don't have ';MARC;'.
I'm looking at MARC records of scanned books that haven't been loaded into Open Library. The first one I've come to is:
Afganistan by Angus Hamilton (1874-1913) http://www.archive.org/details/00hamigoog http://upstream.openlibrary.org/show-marc/00hamigoog/00hamigoog_meta.mrc:0:2387
The title field, tag 245 appears twice in the MARC record:
245 $6 01 $a Afganistan $c A. Gamilʹton ; perevod s angliĭskago S.P. Golubinova.
245 $6 01 $a Афганистанъ $c А. Гамильтонъ ; переводъ съ английскаго С.П. Голубинова.
I think the best thing to do with a record like this is use the first title.
Next year we can work out multilingual records and load the second title as well.
2010-04-20 Tuesday
Spotted that solr_update.py was ignoring save_many. Extracted a list of 46357 works updated/created by the WorkBot, passing to search engine.
Todo
Before release
- change solr_update.py to handle save_many. Not difficult.
- make work finder run for author merge. Maybe from solr_update.py when it sees an author merge?
- update MARC import to search for an existing work, or create a new work, maybe just run the work finder?
- add resume after crash to work finder
- support user created subjects in search engine
- add missing subjects to work pages
- remove bad subjects like /subject/History from work pages
- updates to subject search index
Maybe before release
- load books with MARC that have the scan_date set to null
- load Google books without MARC records
- loaded missing books from Amazon using sitemap
After release
- currently there are a handful of multi-volume works in Open Library, load the rest
- research serials and load into Open Library. Find number of serials missing from Open Library
- add subjects from amazon to work subject field
- add ability to search for encrypted DAISY files.
More scanned books
~/scans/goog_no_marc_not_ol contains a list of 317,697 google books not without a MARC record, not in OL.
Can't transfer the file by e-mail or scp.
2010-04-19 Monday
873025 editions with scans
1566659 scans possible to load, match this criteria:
- scanner is not null
- noindex is null
- mediatype = 'texts'
- curatestate = 'approved' or curatestate is null
- scandate is not null
3133 census items skipped.
Planning to download Meta XML, MARC XML and MARC binary for all 1566659 scanned text items.
First building list of machines where items live. Total run time looks to be 4.5 hours.
Of these 903,127 have 'MARC Binary' in the format field.
1,010,205 records have MARC in the format field.
2010-04-16 Friday
Search for OLID
Redirects to correct page. Done
Scanned books without authors
Scanned books without authors don't have works, so they don't make it into the search engine.
This book has an author on archive.org, but not on Open Library:
Guidance manual for landfill sites receiving municipal waste
The scan was added to an existing book record that didn't have an author.
Need to try adding author data to scanned books if it is available. Need to find some numbers.
Confused work finder
Contains Emma, Pride and prejudice, Sense and sensibility and Mansfield Park because of books like this:
title: The novels work title: Pride and Prejudice
title: The novels. work title: Mansfield Park
title: The novels. work title: Mansfield Park
title: The novels. work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Sense and sensibility
title: The novels work title: Sense and sensibility
title: The novels. work title: Emma
title: The novels. work title: Pride and prejudice
title: The novels. work title: Persuasion
title: The novels. work title: Pride and prejudice
title: The novels. work title: Northanger Abbey
2010-04-15 Thursday
Auto-merge authors
Extracted list of authors from data dump. Now trying to build a file of authors grouped by birth and death dates.
Found 21640 (birth, death) pairs with possible matching authors and 261394 authors.
Scanned books in work search
Extracted list of authors touched by ImportBot since 2010-03-01, total is 106934. Includes some non-scanned books. Running work finder for each author.
Author merge
Live, but not complete, still need to debug solr updates.
Example book
- The future is our assignment (1959)
- Twas the night before Christmas; a visit from St. Nicholas (1912)
2010-04-14 Wednesday
Amazon
Downloaded sitemap_dp_*.xml.gz sitemaps. 2504 files, total size is 1.6G
40,000 URLs per sitemap file.
Generating a list of ISBN from database dump, to compare with Amazon sitemap data.
Spell suggest
Bogus answers from spell suggest if search terms aren't in lowercase.
ImportBot removes work link
Not sure why this happened, I can't reproduce it.
Scanned books without works
Here is a list: ia331504:/2/edward/solr/2010-04-07/missing_work
Made a list of editions that once had a work, but no longer do, 253 of them. Most are edits by the ImportBot in December.
Rebuilding list of missing_works to include previous versions, sorted by version (done).
There are 203,701 scanned editions without works attached.
Looking at how many editions have a title that the work finder skips.
Scanned books that are skipped by the work finder by title:
24 Correspondence
59 Letters
472 Publications
556 Report
69 Plays
25 Calendar
676 Works
315 Sermons
662 Bulletin
no author: 125653
one author: 41861
multiple authors: 33305
Merge authors
Should find some way to auto merge. For example Sándor Ferenczi.
Search pages
Need to add paging to subject and author search pages. Should probably tokenize date fields so I can search for the term 'century' and the year within a full date.
2010-04-13 Tuesday
Amazon
Trying to download Amazon best seller pages by crawling the subject facets in the search engine. Turns out I was only grabbing top level facets, need to go deeper.
Use sitemap instead of search.
Missing works for scanned books
I'm missing works for some scanned books because they don't have an author. Need to add works without an author.
show-marc
Should add link to generate MARC XML for imported MARC records and include nicer display of MARC records for IA MARC records.
Two sources have identical MARC records, the IA record was retrieved from OL.
merge authors
pass action="merge-authors" to save_many