<<Portal Digital Library<<INTERAMER<<Educational Series<<Digital Libraries and Virtual Workplaces Important Initiatives for Latin America in the Information Age<<Chapter 5
Author: Johann Van Reenen, Editor
Title: Digital Libraries and Virtual Workplaces. Important Initiatives for Latin America in the Information Age
2. Students and EDTs
Students are the most important participants in ETD activities. They are the main target of the education effort. They are the ones who learn by doing, and so promote access to the ETDs they prepare to help communicate their research results.
Benefits to students utilizing ETDs
There are many reasons for ETDs. Indeed, if one asks “What are the reasons to not have ETDs?” it is difficult to find any convincing, forward-looking answer. Almost all TDs are produced as electronic documents, and if students know in advance about how to prepare ETDs, then creating their own ETD usually is a very simple process. In addition, there are special benefits that result from ETD creation:
- New genre
The first benefit is that new, better types of TDs may emerge as ETDs develop as a genre. Rather than be bound by the limits of old-style typewriters, students may be freed to include colour diagrams and images, dynamic constructs like spreadsheets, interactive forms such as animations, and multimedia resources including audio and video. To ensure preservation of the raw data underlying their work, promote learning from their experience, and facilitate confirmation of their findings, they may enhance their ETDs by including the key datasets that they have assembled.
As the new genre of ETDs (Fox, McMillan, & Eaton 1999a) emerges from this growing community of scholars, it is likely to build upon earlier forms. Simplest are documents that can be thought of as “electronic paper” where the underlying authoring goal is to produce a paper form, perhaps with color used in diagrams and images. Slightly richer are documents that have links, as in hypertext, at least from tables of contents, tables of figures, tables of tables, and indexes – all pointing to target locations in the body of the document. To facilitate preservation, some documents may be organized in onion-fashion, with a core mostly containing text (that thus may be printable), appendices including multimedia content following international standards, and supplemental files including data and interactive or dynamic forms that may be harder to migrate as the years pass by. Programs, applets, simulations, virtual environments, and other constructs yet to be discovered may be shared by students who aim to communicate their findings using the most suitable objects and representations.
- Minimize duplication of effort
A second benefit of ETDs is a reduction in the needless repetition of investigations that are carried out because people are unaware of the findings of other students who have completed a TD. Except in unusual cases, masters’ theses are rarely reported in databases (e.g., very few, except those from Canada, appear in UMI services like Dissertation Abstracts). Few dissertations prepared outside North America are reported either. With a globally accessible collection of ETDs, students can quickly search for works related to their interest from anywhere in the world, and in most cases examine and learn from those studies without incurring any cost.
- Improve visibility
Once ETDs are collected on behalf of educational institutions, digital library technology makes it easy for works to be found. Through www.theses.org, NDLTD directly makes ETDs available, and points to other services that facilitate such discovery. As a result, hundreds or thousands of accesses per year per work are logged, for example, according to reports from the Virginia Tech library regarding the ETDs it makes publicly accessible (Fox, McMillan, & Eaton 1999a; Eaton, Fox, & McMillan 1998). As the collection of ETDs available grows and reaches critical mass, it is likely that it will be frequently consulted by the millions of researchers and graduate students interested in such detailed studies, expositions of new methodologies, reviews of the literature on specialized topics, extensive bibliographies, illustrative figures and tables, and highly expressive multimedia supplements. Thus, students and student works will become more visible, facilitating advances in scholarship and leading to increased collaboration, each made possible by electronic communication, across space and time (Fox, Hall, Kipp, Eaton, McMillan, & Mather 1997b).
- Accelerate workflow
ETDs can be managed through automated procedures honed to take advantage of modern networked information systems. Since the shift to ETDs requires policy and process discussion among campus stakeholders, it is possible to streamline workflow and save time and labor. Checking of submissions and cataloging is sped up, moving and handling of paper copies is eliminated, and delays for binding are removed. The time between submission and graduation can be reduced, and ETDs can be made available for access within days or weeks rather than months.
- Save money
ETD submission over networks has zero cost, which compares favorably with the charges of hundreds or thousands of dollars otherwise required to print, copy, or publish TDs using paper or other media forms. In many institutions, the networking, computing, and software resources available to students suffice so that students preparing ETDs need make no additional expenditure. Similarly, on many campuses, assistance is available to answer questions and train students regarding word processing and other skills valuable for authors of electronic documents and users of digital libraries. If students elect to use personal computers and acquire their own software to use in ETD creation, these will later be useful in other research and development work, for both professional and personal needs, with low marginal expense specifically required for ETDs. Thus, it is typical that the pros far outweigh the cons regarding students preparing ETDs.
Access to ETDs
Since in most cases it is in the interest of students and universities to maximize the visibility of their research results, the general approach of NDLTD is to encourage all parties interested to facilitate access to ETDs. There are a number of well known sites/resources for ETDs and the NDLTD runs the Web site http://www.theses.org, which also has alias http://www.dissertations.org, as a central clearinghouse for access to ETDs. This site points to various others that support portions of the worldwide holdings of ETDs. For example, the largest corporate archive, with over 1.5 million entries, is managed by UMI, and has most doctoral dissertations from USA and Canada, as well as most masters’ theses from Canada, in microfilm form, with metadata available as a searchable collection through Dissertation Abstracts. Since 1997 UMI has scanned new submissions (originally from microfilm, later directly from paper) and made the page images available through PDF files. With over 100,000 ETDs accessible through subscription or direct payment mechanisms, UMI hosts the largest single collection of electronic TDs as well as of microfilm TDs.
Other corporations as well as local, regional, national, and international groups associated with NDLTD have Web sites too, such as http://www.cybertheses.org for the international Francophone project or http://www.dissonline.org. In addition, a number of WWW search engines have indexed some of the ETD collections available so this genre is included in general Web searches. Some other schemes also allow access to ETD collections. Using Z39.50, the “information retrieval protocol”, for example, the Virginia Tech ETD collection can be accessed through suitable clients or from some library catalogue systems. OCLC’s WorldCat service, with over 20 million catalogue records, has an estimated 3.5 million entries for TDs. Perhaps most promising is that the global as well as regional and local metadata information about ETDs may become widely accessible through the Open Archives Initiative (Van de Sompel 2000). This initiative is discussed in greater detail in van Reenen’s chapter on scholarly communication.
Searching EDTs across sites and in local collections
As part of the education component of NDLTD, it is hoped that graduate students will become facile with searching through electronic collections, especially those in digital libraries. If we regard managing information as a basic human need, ensuring that the next generation of scholars has such skill seems an appropriate minimal objective. Most specifically, since graduate research often builds upon prior results from other graduate researchers, it seems sensible for all ETD authors to be able to search through available ETD holdings. NDLTD encourages the provision of online resources, self-study materials, individual assistance, as well as group training activities so that graduate students become knowledgeable about resource discovery, searching, query construction, query refinement, citation services, and other processes – both for ETDs and for content in their discipline.
Classification systems and schemes
Considering further the educational mission of NDLTD, it is hoped that students will learn other concepts from the fields of library and information science. As emerging scholars, they should grasp the entire information life cycle that is now being supported through digital libraries (Borgman 1996). Some of those aspects are considered below. Here we note that manual or automatic schemes are often deployed to categorize or classify documents so they can later be found by referring to an appropriate category. Indeed, when people browse through a collection, they often navigate through a suitable classification system or “concept space” to find likely portions to examine.
There are general classification schemes, such as the Library of Congress Subject Headings, Dewey Decimal Classification, and simpler schemes prepared by UMI and UNESCO. The US National Library of Medicine has MeSH (Medical Subject headings) as well as the more extensive UMLS (unified medical language) scheme, while for computing the Association for Computing Machinery (ACM) maintains the Computing Classification System. Many other services are offered for other disciplines.
Learning about creating EDT systems
Since education is the core of NDLTD efforts, it is important to ensure that a wide variety of mechanisms are in place, for students, with their varying learning styles, to be aided. First, learning by example is facilitated because thousands of ETDs are available that can be consulted, including many in ones own discipline, as well as exemplary or notable works such as those highlighted from http://www.theses.org. Second, participants in NDLTD typically have online training resources available, such as the Virginia Tech site at http://etd.vt.edu, where general information as well as specific local requirements are addressed. Third, most universities in NDLTD periodically offer workshops to explain about ETD preparation, often tailored to both novice and expert groups. Some of these involve presentations, while others involve hands-on activities. The latter may occur in special classrooms or laboratories, sometimes with scanners and other multimedia devices, to serve specialized as well as common needs. Typically, a campus will have a small cadre of helpers who are knowledgeable about the ETD process, and can resolve unusual problems or address special needs. Though such services are seldom needed at sites where comprehensive computer and information literacy programs are in place, it is appropriate that when ETD submission becomes a mandatory requirement, those who face difficulties should be quickly aided.
Guide to preparing an ETD
Since students learn best by doing, developing their own ETD is the most effective way for the next generation of scholars to be prepared regarding electronic document production. Though details will vary over the years, this practice will ensure that students at any point in time have relevant knowledge and skills appropriate for the available technology.
Students preparing their ETDs should learn about the entire information life cycle, and work so their research results can be accessible to all interested parties, into the foreseeable future. This objective means that they must consider a variety of concepts and practices, related to document preparation and representation, as well as preservation and access, sketched briefly in the following subsections.
Writing in word processing systems
Most authors today use word processing systems. The most popular is Microsoft Word. Corel WordPerfect, in earlier years more popular, is also widely used. For those working frequently with mathematical expressions, the TeX and LaTeX family of tools (including BibTeX for bibliographies) has replaced the earlier-used UNIX suite of troff, tbl, eqn, refer, and other routines.
Office systems, developed for document preparation and high quality typesetting services, also are appropriate for long and complex works such as ETDs, when authors have requisite knowledge and skills. FrameMaker, PageMaker, Staroffice, and other packages are among the popular solutions.
Because ETDs often are complex documents, that may be developed over the years required to complete a graduate research program, it is essential that students master more than the superficial word processing skills required to produce letters and short reports. They should understand key concepts related to fonts, tables, figures, styles, and document structuring. They should be able to migrate files between versions of software, from one machine to another, to differing types of platforms, and through varying media and networks – while maintaining the message behind their content.
Since ETDs should be usable across time and space, it is imperative, however, that access to them be through suitable interchange formats, rather than transient, unpublished representations produced by particular versions of word processing systems. Accordingly, ETD initiatives have recommended widely used interchange formats like PDF, SGML, XML, and the various schemes preferred for particular types of multimedia content. As was mentioned in Section 1.3, it is preferred to have both a rendered form, like PDF, and a descriptive form, like SGML or XML. However, when that is not feasible, it is better to have one of these forms than to delay implementing an ETD initiative.
Preparing a PDF document
The most popular page representation scheme, a published de facto standard developed by Adobe, now being considered as an international standard, is the Portable Document Format, PDF. Adobe has promised to provide a Reader free of charge into the foreseeable future, which will read current as well as previous versions of PDF, so that archives of documents will remain easily usable. Adobe also provides tools for creating, annotating, and manipulating PDF documents, through its own word processing software, printer drivers, and distilling from PostScript. In addition, some public domain tools work on the published PDF format, such as ghostview™.
Adobe’s Acrobat software, installed on a Windows, Macintosh, or UNIX platform, allows most suitable documents to be converted to PDF in moments. From word processors such as Word, WordPerfect, and Framemaker™, each document portion can be “printed” to the Distiller printer driver, yielding a PDF file. The Distiller converts PostScript files to PDF files. Acrobat software allows multiple PDF files to be assembled into larger PDF files by inserting documents or deleting pages in an existing PDF file.
To avoid problems for future readers, authors should embed all fonts in their documents (when that is allowed). Otherwise, software displaying or printing PDF content will attempt to find a similar font and extrapolate from it, which may cause serious problems. Similarly, authors should use so-called “outline” fonts as opposed to bitmap fonts, so that display and printing can proceed to scale characters as required. Thus, when using TeX or LaTeX, the bitmap fonts commonly found in a standard installation should not be used. Instructions at http://etd.vt.edu, for example, explain how publicly available outline fonts can be obtained and substituted. Related problems occur when bitmap images are included in documents and scaled. Vector graphics, special outline font symbols, or object-based image tools should be used instead when possible so that rendering in PDF conveys the correct message. Most problems can be avoided by: planning in advance, following the advice of knowledgeable authors, and testing samples of all types of content that will be in the final ETD.
Preparing for conversion to SGML/XML
Converting from word processing forms to SGML or XML (Standard Generalized Markup Language and Extensible Markup Language, respectively) requires more planning in advance, different tools, and broader learning about document processing concepts than does working with PDF. In addition, the end result is a representation that is easier to preserve, more reusable, and supportive of more powerful and effective schemes for searching and browsing. All of these advantages, however, must be weighed against the facts that there are fewer people knowledgeable about these matters, that often tools to help are more expensive and less mature, and that the process may be complicated, difficult, and time consuming. In 2000, there are tens of thousands of ETDs created by scanning (mostly by UMI, but also at sites like MIT and the National Document Center in Greece), thousands converted from word processors into PDF, and hundreds in SGML or XML – illustrating the relative effort required of students to prepare ETDs in each of these forms.
SGML and XML are markup languages. Both use tags, normally shown in between “<” and “>” symbols, with names or labels inside, around sections of documents that are thus “marked” or “bracketed”. Technically, structures describable this way conform to labelled bracketed grammars. This means that parts are nested within parts, just as subsections are contained within sections. The grammar or structure scheme for a type or class of documents – e.g., book, article, poem, musical score, or dictionary – is specified by a Document Type Definition (DTD). SGML requires a DTD and so is used with well-understood documents. XML, being more extensible while at the same time having stricter rules about closing tags, employs DTDs optionally.
Word processing emphasizes layout or what-you-see-is-what-you-get (WYSIWYG) editing. Emphasizing what documents look like is quite distinct from focusing on the logical structure, for which markup schemes are best. Shifting from word processing representations to XML requires a different way of thinking, a different approach. The problem is harder than producing HTML by exporting from a word processor, since instead of just having a document that looks like the original it is necessary that the marked-up version itself is correctly tagged.
Some word processors have been extended to facilitate such an approach. Microsoft produced SGML Author for Word™ as an add-on package for Word 95, and new versions of WordPerfect can export content according to markup schemes. Eventually it is likely that most popular word processors will export to XML. Clearly, the resulting markup can surround document sections, headings, paragraphs, lists, figures, tables, citations, footnotes, hyperlinks, and other obvious constructs. In addition, regions with the same style can be tagged. Thus, to allow easy conversion from word processing to markup schemes requires choosing a target DTD and then consistently using document objects and styles so that there is a clear mapping from them to tags.
Conversion from LaTeX is slightly simpler since the TeX approach involves using formatting commands that can be mapped to tags in XML. However, LaTeX does not require strict nesting of commands, so it may not be clear where to place end-tags. Further, LaTeX users may not consistently use the same sequences to designate changes in structure, making translation more complex. Finally, LaTeX coding of mathematical expressions is very difficult to translate to markup schemes for mathematics, like MathML.
Because of the inherent complexity of converting from word processing schemes to markup representations, it is necessary to include steps for checking and correcting converted forms. Parsers can ensure syntactic correctness, so detecting problems is often simple. To ensure semantic correctness, however, manual inspection may be required. A further test would involve rendering the marked-up document, for example to a printed or PDF form, and ensuring that the result suitably matches the output resulting from the original word processing version. In any case, human labour is likely to be needed to correct conversion errors, and presupposes that students understand enough about the process and desired output to accomplish this with facility.
Writing directly in SGML/XML
Since having an ETD encoded using SGML or XML is a desirable result, it also is appropriate to use special word processors or other tools developed for directly producing marked up documents. This is somewhat analogous to the process of directly producing HTML, and no doubt a broad range of tools like those available for HTML will eventually be suitable for XML authoring.
One approach, suitable for experts, is to prepare a text document using a text processing tool or editor like notepad, vi, or emacs. Then all tags must be manually entered, and document structure specified by hand. Alternatively, structure editors designed specifically for XML can be employed. Since the demand for such is smaller than for conventional word processors, currently available tools either are expensive, limited, or not very mature. Further, it is necessary that a syntax checker or parser either be built into the editor, or used in coordination with it, so that errors are quickly corrected.
Integrating multimedia elements
While most training related to word processors covers conventional text documents, perhaps along with simple drawings and inserted pictures, handling of multimedia portions of an ETD is often best managed through separate processes. Tools and special hardware exist for entering and editing complex graphics, images, sound, music, animations, video, and interactive multimedia productions. On most campuses, special laboratories or offices exist that have suitable facilities along with experts who can train seriously interested authors. However, the learning curve for such is often steep, and students should not lightly choose to include multimedia content unless it really helps them express their research results and/or will lead to skills they desire for the future.
Once produced, multimedia content should be saved in a suitable standard form. International standards like JPEG for images or MPEG for audio and video should be employed so that in future years it will be easy to understand such content. Since such conversions, however, may lead to some losses due to translation and compression, authors may wish to include both the original multimedia content as well as the standard version.
Similarly, as an aid to those interested in reading an ETD, multimedia content may be included in a number of forms. Thus, if a reader wants to view a video but only has moderate bandwidth available to download the ETD, they may be satisfied with a much smaller low-resolution version of a video. At the same time, another reader with a faster connection may prefer to view a high-resolution version. Finally, a reader with a very low bandwidth connection may want to see only a small set of images that are key frames summarizing the video.
Ultimately, multimedia content must be connected to the rest of an ETD. Usually the multimedia information is stored in separate files. These may be referred to or even linked (through hypermedia constructs) to the text or other multimedia constructs. One often appropriate scheme is to have a thumbnail image in the body of the document, which, when selected, links to a corresponding much higher resolution image, and/or video.
Providing metadata – inside and outside documents
In addition to multimedia, documents are often supplemented with metadata (i.e., data about data), typically catalogue information. Through a series of meetings that started in January 2001, a metadata specification conforming to the Dublin Core (Dublin Core Community 1999) standard and tailored to describe ETDs has been under development. The aim is that eventually every ETD will have an associated metadata description following that specification.
Such metadata can be included inside an ETD, making it a self-describing document, especially when XML is used. It is straightforward to encode Dublin Core based metadata in XML, and that can be included near the beginning or in a header portion of an XML ETD. This is similar to the practice with documents encoded in SGML according to the TEI or TEI-lite standards, developed through the Text Encoding Initiative.
Alternatively, and clearly required for previously prepared SGML or XML documents, or documents represented in PDF, metadata can be a separate XML file that is associated or linked with a particular ETD. Varying approaches to packaging data and metadata together are possible. Note, however, that when metadata is separate, it is then possible for it to be replicated, distributed, and harvested so that ETDs can be more easily discovered without requiring that the actual ETD be examined. Indeed, to allow such processing, even when metadata is included inside an ETD, it is recommended that routines be prepared that can extract the metadata portion to allow separate use.
Protecting intellectual property and dealing with plagiarism
Although in most cases it is beneficial to share research results, so that others can learn from student studies and give credit to them through citations, it is necessary to provide various types of protection when desired by authors or to deal with potential abuses. Automated schemes can help, such as watermarks, digital signatures, and checksums; these are discussed further in the section below on producing EDTs. Programs to detect plagiarism also can be used to compare a new ETD with already available ETDs, ensuring that blocks of identical or similar text are not copied. Further, education, training regarding ethical and professional behavior, and suitable policies can support the guidance of faculty and university staff to promote the spirit of scholarly investigation and collaboration.
To maximize portability, students should name the various parts of an ETD using the lowest-common-denominator standard for file names, typically the “8.3” form used in old systems like DOS, where a name of no more than 8 alphabetic characters is followed by a period and an alphabetic file type (e.g., pdf, jpg, mpg, txt, xml, sgm). If possible, complex directory structures should be avoided and a simple flat list used, also to ensure portability. Further, references to those names should be relative, rather than absolute, e.g., as etd.pdf rather than c:\documents\etd.pdf or /usr/student/thesis/etd.pdf.
Clearly, each file should have a unique name. Similarly, each ETD in a collection should have a unique and permanent identifier. Since each degree-granting institution can use a unique identifier for their archive, every ETD in the world can have a unique overall identifier made by composing the archive and ETD identifiers.
Submission of individual ETDs
Once a student has prepared an ETD, in most institutions involved in NDLTD, they can submit their work over the internet to a local or regional site for further processing. Following local policies, procedures, and instructions, delivered through training sessions or explained on a Web site, they will typically invoke a Web browser on the computer where their ETD resides. The workflow usually involves them entering a password or other authentication of their identity, filling in a form that provides needed metadata information, and uploading each of the files in the ETD “package”. Since they will supply their email address during this process, they can be notified, by those enforcing quality control standards in the graduate program and library, regarding any corrections or missing data they must supply, as well as when key stages in the approval process are achieved.
Becoming a researcher in the electronic age
In addition to learning about word processing, electronic document processing, and key concepts related to digital libraries, students also must gain other skills in order to be prepared to be researchers and scholars. They must be ready to meet future challenges of the electronic age, where technology continues to advance, often leading to changes in common practices that may save time or improve accuracy. Caution regarding unproven technology is sensible, but straightforward advances like increases in computing and networking speeds, or decreases in prices of experimental equipment, may be unwise to ignore. Further, innovations may lead to tools dramatically aiding their investigations. Thus, learning to deal with change is part of the wisdom that scholars must develop to survive in the complex modern world.
At the same time, scholars must remained anchored by core values such as honesty, integrity, curiosity, ingenuity, generosity, friendship, diligence, perseverance, and responsibility. They must follow the dictates of society and ethics as well as reason and truth. They should give credit as due to those who have helped them or advanced knowledge in ways related to their work.
With the aid of faculty and colleagues, following departmental and other local and discipline-specific practices, they must choose what type of access is appropriate to the various parts of their ETD, and when. Generally, the decision will be simple, allowing universal access to the entire work. If they must limit access, it is recommended that they do so for as short a time as possible and for as few parts of the ETD as is necessary, to maximize the amount and duration of access. In general, scholars are rewarded most by sharing their discoveries as widely as possible, but in today’s entrepreneurial world they may seek patent protection in order to have time to commercialize their work, if it involves one of the small number of inventions that are ready for technology transfer. If publishing is appropriate, on the other hand, they should seek to ensure that their ETD is available as well as any related prior or derivative works released in the form of articles or books. In some cases they may be required to delay access to (part of) the ETD (for some period of time). What is most important in all this, however, is that students and faculty honestly confront their responsibilities as scholars, learn key concepts related to intellectual property rights, respect laws and policies, follow contracts and agreements with sponsors and publishers, and strive to achieve balance among the many conflicting opportunities and demands they face. All in all, preparing an ETD should greatly expand the learning experience of graduate researchers, thus helping better prepare the next generation of scholars for the Information Age.