Building Better Search

Have you ever searched for ‘Buddhism’ only to find records which contain exactly ‘Buddhism’ and not ‘Buddhist’, ‘Buddha’, or related concepts like ‘Maitreya‘ or ‘Bodhisattva‘? Frustrating, no?

Problem

Like many museum’s online collections, the Penn Museum’s first online collection site (launched in January) worked like the previous example and matched the terms a user searched for against the terms used in a catalog record. This type of search works quite well when either:

All records are fully described, using the same terms that a user is likely to search for (e.g. using both ‘Buddhism’ and ‘Buddhist’)
Users know how the collection is cataloged and can align their searches to accommodate our terminology (e.g. knowing to search for ‘Maitreya’ and not ‘Budai‘)

Unfortunately these conditions are almost never true. Some of our catalog records are very complete with vivid, detailed descriptions but much of the collection is minimally cataloged. Nor can users be expected to know off-hand how our 330,000 object records have been described over the last 125 years. Over the last six months we have worked to exploit existing data to improve our online search without re-cataloging all 330,000 records to meet the previous two conditions for running a successful search.

Background

Since the 1980s a core component of the Penn Museum’s various collections management systems has been a hierarchical controlled vocabulary, in Questor Systems’s Argus it was called the Lexicon, in KE Software’s EMu, it is called the Thesaurus but in both cases, it is a set terms organized into a hierarchy that facilitate object cataloging and searching within the collections management system. The Penn Museum’s thesaurus contains approximately 67,000 terms and controls the data entered in fields like Object Name, Provenience, Material, Culture, Technique, Maker, Culture Area, Subject, and Function. The content and structure of the thesaurus allow curators, collections managers and museum staff to catalog an object with a Provenience of ‘Cincinnati’ and then be able to find that object by searching for ‘United States’ or ‘Ohio’ or ‘Porkopolis‘ because of the hierarchical relationship between the terms.

How terms are organized in the thesaurus

Over the last twenty years, this structure has become so ingrained in how museum staff catalog objects that the use of discipline specific terms and limited object level cataloging (why enter ‘United States, Ohio, Cincinnati’ at the object level when you can enter ‘Cincinnati’ and let the thesaurus work for you?) presented huge barriers for online discovery because traditional online discovery requires that all metadata exist at the item level. We quickly discovered that users were unable to find objects they knew we had because their queries didn’t match the object level metadata. However we found that many of the search terms did exist in the thesaurus.

What if we use the content and structure of the thesaurus to improve the quality of our search engine?

After experimenting with Apache Solr, we found that it is possible to use the thesaurus and Solr to replicate the functionality of the collections management system in online searches (searching for “United States” will now find objects that are cataloged as “Cincinnati”).

How

One of my favorite things about EMu is that they provide a set of APIs. Using the API we are able to export the thesaurus content and structure into Solr and then create two text files that are used by the Solr SynonymFilterFactory to index catalog records and expand searches.

The first text file (index.txt) is used to analyze and index catalog records. Each row in this text file contains a term from the thesaurus and its primary key in the thesaurus table.

Qing Dynasty=>68250

If a catalog record contains the term ‘Qing Dynasty’, Solr associates the value ’68250′ with the record in addition to the text value ‘Qing Dynasty’.

The second text file (query.txt) is used by Solr to expand a searches. Each row in this file contains a term (Qing Dynasty), any alternate spellings (Ch’ing Dynasty, 大清) , the broader term (Chinese Dynasty) and the primary key for the term (68250).

Qing Dynasty,Ch’ing Dynasty,大清, Chinese Dynasty=>68250

When someone searches the online collection for “Chinese Dynasty”, their search term is passed to query.txt. Each time Solr finds ‘Chinese Dynasty’ on the left side of the => operator it uses the value on the right side of the => as a search term. So if the query.txt file looked like this:

Qing Dynasty,Ch’ing Dynasty,大清, Chinese Dynasty=>68250
Qin Dynasty,Chinese Dynasty=>68528
Shang Dynasty,Chinese Dynasty=>68524
Han Dynasty,汉朝, Han Ch’ao,Chinese Dynasty=>68503

Then when a user searches for “Chinese Dynasty” in the Period field, the Solr query looks like

and it will return all records that use the term “Chinese Dynasty” or any of its narrower terms in the Period field.

Results

Whether this kind of search is useful is still an open question (or if users even recognize that it is happening) but it was worth trying, certainly our staff has found it quite useful since they run these types of queries all the time within EMu. This is a work in progress and there are improvements that we are planning to make but this is was a large step toward programatically improving resource discovery without re-cataloging the entire collection.

Sample searches

“West Asia” ”Armament T&E”

“Greek god” vessel - Note that none of the records contain the terms ‘Vessel’ or ‘Greek god’ but do contain representations of both

“Behavioral Control Device” - See Chenhall’s Nomenclature

“North America” basket -woven - Baskets from “North America” that are NOT woven (the minus (-) is the NOT operator and can be used to exclude items from your query)

Building Better Search

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112