AMIA-MIG Archive Directory Working Group

TASK FORCE D:
Collection Size
Formats Collected
Organization Location
OrganizationID

Charge

Write the Collection Size, Formats Collected, Organization Location, and OrganizationID sections of the Moving Image Gateway Archive Directory Report, the working document MIG developers will use to design the Archive Directory database: finalize options for defining collection size, including controlled vocabularies with scope notes. Finalize controlled vocabulary with scope notes for the Formats Collected data element. Identify modes of geographic access (city, state/province, country, metropolitan area, non-jurisdictional geographic area) and corresponding controlled vocabularies with scope notes if applicable. Describe specifically how NUC codes can be used for Organization ID and Organization Location data elements, if that is your recommendation. Resolve any related issues that were raised at the July 25-26 meeting (as described in minutes and breakout session reports/discussion), documenting the rationale for your decisions. As you work, note any issues that you think should be addressed in guidelines for participant (organization) input and guidelines for end users and submit those with your report.

Build on the work done at the July 25-26 Archive Directory Working Group meeting; do not start from scratch. Attached is a list of who discussed what in the breakout sessions; you are encouraged to contact members of those groups to discuss issues and/or clarify points. Utilize resources identified below and consult with outside experts as you see fit. If you encounter an issue that you feel needs broader discussion, submit it to the Archive Directory Working Group listserv.

Roster

Randy Barry, Chair
Karen Cariani
Gary Carter
Jim Hubbard
Mairéad Martin

Related documents

Minutes, Moving Image Gateway Archive Directory Working Group Meeting
MARC Code List for Geographic Areas [hierarchical, to state/province level only]
http://lcweb.loc.gov/marc/geoareas/
National Authority File (Library of Congress Authorities)
http://authorities.loc.gov/
Getty Thesaurus of Geographic Names [hierarchical] http://www.getty.edu/research/tools/vocabulary/tgn
Statistical Abstract of the United States, Appendix 2 (for U.S. metropolitan areas):
http://www.census.gov/prod/2002pubs/01statab/app2.pdf
Cornucopia "Structure" document (http://www.cornucopia.org.uk/tech.html)
Breakout session rosters ("Who discussed what")
Questionnaire created for Footage: the Worldwide Moving Image Sourcebook
Other directories outside the moving image area that might have pertinent vocabularies
Rules for Archival Description (for Formats Collected)
SMPTE (for Formats Collected)
Archival Moving Image Materials: a Cataloging Manual, 2nd ed. (for Formats Collected)
SMPTE technical glossary? (for Formats Collected)
European Broadcasting Union (EBU) (for Formats Collected)
Vidipax publication on video formats (for Formats Collected)
FIAT guide to AV archives(for Formats Collected)
FIAF guide? (for Formats Collected)
International Association of Sound Archives? (for Formats Collected)


August 29, 2002
Moving Image Gateway Archive Directory Project
Contribution for the Database Design of the
Collection Size, Formats Collected, Location, and Organization ID Elements

August 29, 2002

Collection Size:

This section deals with the Moving Image Gateway Archive Directory element that identifies collection size. The main issue is the variety of ways the size of a collection can be measured. If the staff of an organization have measured or counted its materials one way, they probably wouldn't be inclined to measure or recount the collection using a different unit of measure in order to put the information into the Archive Directory. If the Directory wants to encourage participation and capture as much information as possible from a wide range of archives, then it will have to make this field flexible enough to account for different units of measure.

A corollary issue is the difference between the systems of measurement in the U.S. and other countries, especially in Europe. Should both the metric and U.S. (imperial) measuring system be offered? There was no consensus during the July meeting of the MIG advisory group.

Several of the breakout groups felt that a simple choice between small, medium, and large (S-M-L) for identifying collection size was too subjective. During group discussion, it was noted that a designation of S-M-L could be determined systematically, after data gathering, based on some criteria applied to mandatory data submitted by the archives. The group recommended capturing a number of titles and/or shots in a single element. If this was not relevant, then an archive could submit some other kind of information describing the size of the collection. In general, all groups thought that if the number of units was offered as a data input/capture option, that is should include:

Numeric values:

· number of discrete titles
· number of physical items (e.g., reels, cans, tapes)
· number of hours of footage
· length of film (in meters of feet)
· length of shelving occupied (in linear meters or feet)
· cubic measurement of storage space occupied (in cubic meters or feet)
· number of shots

The number of collections was discussed and not thought to be useful.

Using S-M-L as a collection size descriptor was not considered by all to be helpful for searching. Collection content is generally more important for searching. One possibility could be to allow the identification of a generic first-level descriptor of S-M-L and then allow unit size descriptors that were more specific. A least one should be mandatory to get some information about the size of the collection. The first-level S-M-L was generally preferred as the mandatory element. A suggestion on how S-M-L might be defined is given below:

Generic size identifier:

1. Small: less than 5000 unit
2. Medium: greater than 5,000 but less than 100,000 units
3. Large: greater than 100,000 units

Other suggestions that arose during the discussion included a suggestion that collection size be split into two elements, one with a controlled vocabulary, the other allowing free-text. It was noted that for technical reasons, there should be limits on the size of entries. If data are to be used dynamically to generate web pages, the quantity of this data needs to be manageable. Another possibility is a list of unit qualifiers, which could be associated with a numeric value. This would allow archives to choose the unit and number of those units in a collection.

Finally, it was suggested that it might be useful to include a checkbox to indicate if a collection is static or growing. This could be expanded to indicate the rate of acquisition or collection growth. Some felt that whenever lists of units or choices are offered, "Unknown" and "Other" should be among the choices. (Note: this would not be particularly useful for searching, but could be in the Web-generated displays.) It was suggested that some elements could be made repeatable, depending upon the design of the database, and providers of information about an archive would be encouraged to select all that apply, in cases of diverse collections.

+++++++++++++++++++++++++++++++++++++++

Formats Collected:

This section deals with the Moving Image Gateway Archive Directory element that identifies the formats of the items in a moving image archive collection. This will be one of the most technically specific elements in the MIG AD database and an important one, since the use of moving image collections is usually dependent upon the format of the materials.

Formats collected would have to identify one of a growing number of standards that are or have been in use. Many of the formats are well-known, but some are obscure. The obscurity of some of the formats suggests that definitions may be useful for some or all. Options for formats could be provided in the form of pull-down lists.

The identification of formats collected would consist of a generic, high-level identifier. For conveying specific information, two schemes immediately present themselves. Either archives could be encouraged to list the most important formats collected in the description of their collections, or a second element could be provided with a specific, though not exhaustive, list of formats.

The high-level identifier would offer the following 5 choices:

o Film (including audio recordings on film, e.g. mag tracks, optical tracks)
o Videotape (analog/digital)
o Audio recordings (all audio not on film or digital file)
o Video discs (CD/DVD/Laserdiscs)
o Digital files (sound or picture; include DLT)

The list of formats for the second element would include the following:

Video formats:
1/2 in. open reel
8 mm. open reel
1 in. open reel
2 in. open reel
Betacam
Betacam SP
CD
D1
D2
D3
D5
DCT (Ampex Digital Component Tape)
Digital Betacam
DV
DVCPro
DVD
MiniDV
SVHS
U-Matic (3/4 in.)
VHS

Film Formats:
8 mm. film
9.5 mm. film
16 mm. film
super 16 mm. film
35 mm. film
Super 35 (Superscope 235?)
IMAX formats (15/70 and 8/70)
70 mm. film

The above list is subject to further refinement to meet the needs of a larger number of constituent archives. In addition, it may be necessary to give archives the option of identifying "other film gauges," "other video formats," and perhaps "all formats," for archives that collect many formats.

It was suggested that there needs to be a category for high-definition-friendly materials as well, whether they be camera original items (Super 16mm, e.g.) or transfers made with HD repurposing in mind (16:9 anamorphic transfers of Super 16 materials, e.g.). It is unclear how digitized files (of footage born digital or produced from analog materials) would be characterized in terms of format. Perhaps these would fall into the "other" or "not applicable" category, since they could be output on various formats if needed for viewing.

+++++++++++++++++++++++++++++++++++++++

Location:

This section deals with the Moving Image Gateway Archive Directory element that identifies the location of the archive. During its July 2002 meeting, consensus was reached that location information should consist of three levels. The three levels identified were: 1) city or town, 2) region (state, province, or other intermediate jurisdiction), and 3) national level location. Although for the U.S. these levels of jurisdiction are fairly easy to identify, for other countries, the situation is not as simple. Some countries have more than one level of regions (e.g., the UK has large regions, such as Scotland, in addition to a system of shires within the larger regions).

1) National level:

Consensus for what constitutes a nation is codified in ISO 3166-1, an International Standard which provides a list of 239 countries by name and code. ISO 3166-1 country codes are commonly used as the last portion of many email addresses, Web site host domains, and other applications where country codes are needed. The MIG database could make use of ISO 3166-1 country codes and/or names for storage and access of national level location information in the database. For input/capture of data, pull-down menus could be designed that offer choices of ISO 3166-1 countries. What is actually stored in the database could be left up to the database designers, as long as either code or full name could be served up in search result sets. Lists of ISO 3166-1 codes and names could be easily built into the MIG database interface. Although the list of country names and codes is not appended here, it can be provided to database developers at a later time.

2) Regional level (State, province, or other intermediate jurisdiction):

Consensus for identifying jurisdictions below the national level is embodied in ISO 3166-2 (Region names). Not all jurisdictions assigned country-level identifiers in ISO 3166-1 have corresponding region names in ISO 3166-2, however. For example, although the island of Puerto Rico is assigned a country-level code in ISO 3166-1, no regions have been identified for Puerto Rico in ISO 3166-2. This is consistent with the other state-level regions in the U.S. for which no other jurisdictions (for example, counties) are provided in ISO 3166-2.

Most countries assigned codes in ISO 3166-1 have established lists of regions in ISO 3166-2. It is unlikely that those that do not have regions identified would fall into a category of countries with so many moving image archives that this would be a problem. The MIG database could make us of ISO 3166-2 region codes and/or related names for storage and access of regional level location information in the database. For input/capture of data, pull-down menus could be designed that offer choices of ISO 3166-2 regions. This would get around the problem of people sometimes providing abbreviations for parts of an address. The interface could enforce the form of the region name. What is actually stored in the database could be left up to the database designers, as long as either code, abbreviation, or full name could be served up in search result sets. Lists of ISO 3166-2 codes and names could be easily built into the MIG database interface. They are not appended here but are available on the Web in a Canadian site (http://geotags.com/iso3166/) and can be provided to database developers at a later time.

3) City or town level (populated places):

Of perhaps greatest importance, yet more difficult to standardize and codify, is the identification of the populated places, usually cities or towns. The MIG Archive Directory will definitely need to include the place in which a moving image archive is located. Doing so in a standardized way is important. Various efforts to identify and name terrestrial populated places have been made. The Getty Thesaurus of Geographic Names (TGN) has been suggested as a possible source of the names of jurisdictions at this level. TGN provides names, geographical coordinates, references, and other important information for all sorts of geographic entities (e.g., continents, countries, regions, populated places). It does not provide codes for the jurisdictions it lists, however. TGN prefers names in English and usually provides Latin script vernacular (non-English) forms of name as references. These could be useful to data input/capture and retrieval. Since TGN is a mixture of geographic names at various levels, it is unclear how this list would be integrated into the MIG data input/capture interface, or the database itself. Users would most like expect to be able to search (or limit) by populated place. It may be necessary to work with Getty to integrate the MIG Archive Directory with TGN, particularly if the database hopes to integrate the TGN forms of name into the MIG database.

A coded representation of populated places may not be useful or possible. Although global airport codes have been suggested as a possible source of codes for populated places, only a small percentage of populated places have been assigned codes. The three-letter structure of airport codes would not provide for enough expansion to accommodate all the moving image archives likely to need codes. It has been suggested that MARC organization codes, which include a country, region, and city prefix, might provide either design features or actual codes to be of use to the MIG Archive Directory Project.

The MARC Code List for Organizations has many well-established prefixes for cities. Although the city prefix syntax has varied in the past from one to five letters, it could be fixed at two, three or four characters to support the MIG project. This may enhance retrieval of MIG database records. A design decision will need to be made as to whether storage and retrieval of populated place information is worthwhile. Algorithms could be developed to generate City codes automatically, although this runs some risks since the application of an algorithm sometimes results in combinations of letters which are obscene or otherwise unacceptable.

Data for populated places poses special problems for small and incorporated towns which may be associated with other, more well known populated places. This is especially true of small populated places near major metropolitan areas. The solution to this problem would be the inclusion of an element in the MIG Archive Directory database for major nearby populated places, when the actual populated place in which an archive is located is not well known (for example, for Zoetermeer, a small suburb of The Hague, Netherlands, which is less likely to be known outside the Netherlands). The input/capture database could provide an option for including data about the nearest major city. The data for this subelement could be tied to the same source as for the primary populated place for the archive (i.e., Getty TGN?). Not all archive directory entries would need to include this subelement. Both primary and related populated places would be treated equally in searching and retrieval. A person searching for archives in a particular place would retrieve those nearby as well, when they identified themselves as alternatively located in another the populated place.

The MARC Code List for Organizations, which is available as a database and which has established prefixes for most U.S. cities and many foreign cities, could be used as the source of city codes. Some manipulation of the MARC ORG code database would be needed to harvest this data.

+++++++++++++++++++++++++++++++++++++++

Organization ID:

This section deals with the Moving Image Gateway Archive Directory element that will contain a unique identifier assigned to each archive. No structure and syntax requirements for the identifier have been prescribed, but adherence to some standard is advisable. With the advent of the ISIL (International Standard Identifier for Libraries and Related Organizations--ISO 15511), it is recommended that any ID element designed for the MIG database conform to that International Standard.

ISILs make use of standardized prefixes, usually geographical ones from ISO 3166-1, which identify the location of an organization at a high level. This would allow the generation of MIG database IDs from location information for a moving image archive. ISO 15511 is a flexible standard which does not specify the style of the ID beyond the highest level portion. Although the standard does prescribe a syntax consisting of various geographic and onomastic portions, MIG database designers would be free to fashion IDs that meet the needs of the database itself, both in terms of data management, and automatic assignment of new IDs.

During the July 2002 meeting of the MIG Archive Directory advisory group, some suggested that NUC symbols might serve as organizational IDs in the MIG database since they are compatible with ISO 15511.. (NOTE: With the demise of the National Union Catalog Program at the Library of Congress, the NUC symbol list became what is now entitled the MARC Code List for Organizations. NUC symbols are now called MARC organization codes, or simply ORG codes.) The current assignment practices for codes in the MARC ORG codes list meet the requirement for ISO 15511. Although some MARC ORG codes reflect older assignment practices, non-conformant codes can be revised when necessary to meet the needs of the MIG Archive Directory database. The MARC list currently includes unique codes for over 33,000 libraries, archives, and related organizations. MARC ORG codes have not yet been assigned to many moving image archives, but this work could be done within the current assignment infrastructure. The Library of Congress processes are large number of requests each week for new codes, most coming via online request forms which are available from the MARC Organization Code homepage on the Web. If it is decided to use MARC ORG codes as directory record IDs, they would have the following characteristics and syntax:

1) A geographic prefix consisting of the ISO 3166-1 two-character uppercase alphabetic code for the country of location of the archive;

2) A geographic portion for the region from ISO 3166-2, separated from the country prefix by a hyphen (this portion could be optional for countries for which regions are not commonly identified;

3) A geographic portion for the city or town where the archive is located. An effort would be made to make this portion of the identifier unique for IDs within the country, that is, no two cities in the same country would ever use the same string of letters or digits to identify the city. Either the existing city prefixes from the MARC Code List for Organizations could be used, or new prefixes could be generated algorithmicly;

4) Organization name portion that could be derived from letters in the name (the MARC Code List for Organizations usually takes the first letter from the first three words in the name, or if this results in a conflict with the code for another organization, the first three letters from the first word in the name are taken). The organization name portion of the ID could also be selected randomly and need not be alphabetic (for example, a unique, sequential number). New archives could be allowed to choose the letters for this portion themselves. This might be advisable if the ID is to have any prominence in the MIG Archive Directory database.

In all cases, an ISIL string must be unique when compared in its entirety to other ISILs. ISILs must be unique regardless of the case of the alphabetic characters used in the ID string. MARC organization codes currently meet the above design requirements. NOTE: Codes for U.S. organizations do not currently include the prefix AUS-A, which could be added as part of the project to build this directory.

Example: US-CaBerPFA (ID for the Pacific Film Archive, in Berkeley, California, U.S.A.)

The MIG database would be searchable by archive ID, for cases where a use knew the ID.