Main Documentation File for the JURIS Corpus
============================================ 0. Introduction This file describes
the contents and organization of the JURIS text corpus, with regard to the following points: -
Organization of the corpus - File naming conventions - SGML formatting - categories of
documents in JURIS - additional tables describing JURIS content 1. Organization of the corpus
The data contained on this two-CD-ROM set represents a release of the JURIS (Justice
Department Retrieval and Inquiry System) data collection that has been made available to the
Linguistic Data Consortium (LDC) by the U.S. Department of Justice. Each CD-ROM contains
data files in SGML format that have been compressed using "Gnu Zip" (gzip) utility which is the
Free Software Foundation's compression tool. It is available for most computers and operating
systems. You can get the gzip utility from MIT's gnu repository: ftp://prep.ai.mit.edu/pub/gnu/
and many other mirror sites. In the Microsoft Windows environment, WinZip utility from Nico
Mak Computing Inc. can be used for uncompressing the files. You can download WinZip utility
from the Web site: http://www.winzip.com The total uncompressed size of the data is 3206.6
megabytes (MB). Each CD-ROM contains a "doc" directory (containing documentation and
tables) and a "juris" directory (containing the compressed text files). There are 1664 individual
text files in the corpus, 1011 on the first CD-ROM, and 653 on the second. 2. File naming
conventions All text files are named according to the following pattern: jNNNN_II.gz where: "j"
is a constant initial character NNNN is a four-digit "file-set" number II is a two-digit sequence
number within the file-set "gz" is a constant file name extension (indicating that the file has been
compressed with "gzip") The organization of data into "file-sets" was drawn directly from the
original form of the archive as provided to the LDC by the Department of Justice. The
motivation for this partitioning of the data has not been fully explained. The original archive
consisted of 219 files named by distinct four-digit numbers; these files ranged between less than
1 MB and nearly 70 MB in size. In order to make the data more accessible for research use, we
chose to divide the larger files into pieces, such that the average file size was about 2 MB when
uncompressed (the largest uncompressed file size is about 4.5 MB). Divisions of the files were
done at document boundaries, so all files contain whole documents. In cutting the larger files
into smaller pieces, we added the two-digit sequence number to the file names to preserve the
order of the pieces; if the original file was not too large, we left it as a single file, and added
"_00" for the two-digit sequence number in the file name. 3. SGML Formatting The text files are
all formatted using a set of SGML tags to mark document boundaries, and to mark major
structural features within documents. As with file organization, the markup is derived from the
document structures as provided by the Justice Department. There is a functional Document
Type Definition (DTD) file for use with an SGML parsing utility (such as James Clark' "nsgmls"
program and related utilities, available from http://www.jclark.com/sp/index.htm); the DTD file
("juris.dtd") is in the "doc" directory. In summary, the following SGML structure is used in the
texts:
All tags are presented one tag per line, and actual text content is kept on separate lines. Each text
file begins with a "" tag (and ends with ""), the FILE unit contains one or more documents, each
of which is bounded by "" and "" tags. The other tags may occur freely within each DOC unit,
interspersed with text data. The names and locations of these tags represent a simple and direct
"re-spelling" of typographic and structural markup that was found in the original archive. As
such, there may be some variability in the usages of these tags, reflecting different practices
among the people who created the original data base. Again, as with the partitioning of the data
into file-sets, the full meaning of the markup (e.g. the values provided with the "" tags) has not
been fully explained. Within the text content of each file, there is frequent use of the ampersand
character ("&"). Since this character has special meaning for SGML parsers, it has been
systematically replaced in all files by the common SGML entity reference "&" -- for example, all
occurrences of "AT&T" in the original text have been rendered in this publication as "AT&T",
and similarly for all other uses of "&". 4. Categories of documents in JURIS There are a total of
694,667 document units in the corpus, and these can be categorized (to some extent) with regard
to their content. The following is a partial list of categories and their descriptions (drawn from
one of the documents contained in the corpus): * ADMINISTRATIVE LAW Published
Comptroller General Decisions; Unpublished Comptroller General Decisions; Opinions of the
Attorney General; Office of Legal Counsel (US Dept. of Justice Board of Contract Appeals;
ADP Protest Report (Summary of ADP Procurement Protests before the GSBCA); Federal Labor
Relations Authority Case Decisions; FLRA Administrative Law Judge Decisions; Federal
Service Impasses Decisions; Decisions and Reports on Rulings of the Assistant Sec. of Labor for
Labor Management Relations; Federal Labor Relations Council Rulings on Requests of the Asst.
Sec. of Labor for Labor Management Relations; HUD Administrative Law Decisions; Merit
System Protection Board Decisions; Decisions under Immigration and Nationality Laws;
Environmental Protection Agency General Counsel Opinions; Equal Opportunity Commission
Decisions; Equal Employment Opportunity Commission Policy Statements; US Office of
Government Ethics Decisions; HHS Department Appeals Board Decisions. * DEPARTMENT
OF JUSTICE BRIEFS Office of the Solicitor General; Civil Division; Civil Division Trial;
Environmental and Natural Resources Division; Tax Division Criminal Appellate; US Attorney's
Offices; US Trustees' Offices. * CASE LAW U.S. Supreme Court; Federal Reporter, 2nd Series;
Court of Appeals Unpublished Decisions; Federal Supplement; Federal Rules Decisions; Atlantic
2nd Reporter (DC only); Bankruptcy Reporter; Courts of Military Review; Military Justice
Reporter; Court of Claims. * FREEDOM OF INFORMATION ACT FOIA Update Newsletter;
DOJ Guide to the FOIA Case List Publications. * FEDERAL REGULATIONS Code of Federal
Regulations; Unified Agenda of Federal Regulations; Defense Acquisition Regulations. *
TREATIES AND OTHER INTERNATIONAL AGREEMENTS United States Treaties and
Other International Agreements; Department of Defense Unpublished International Agreements.
* INDIAN LAW Opinions of the Solicitor (Dept. of Interior); Ratified Treaties; Unratified
Treaties; Presidential Proclamations; Executive Orders and Other Orders Pertaining to Indians. *
IMMIGRATION AND NATURALIZATION LAW Decisions Under Immigration and
Nationality Law; Title 8 - Code of Federal Regulations; Immigration Reform and Control Act of
1988, Legislative History; Equal Access to Justice Act, Legislative History. * STATUTORY
LAW Public Laws; United States Code; Executive Orders; Anti-Drug Abuse Act of 1988;
Section-by-section analysis of anti-drug abuse act of 1988; Criminal Division Handbook on
CCCA; The Organic Laws of the United States. * TAX LAW US Tax Court Decisions; US
Board of Tax Appeals Decisions; Tax Division's Summons Enforcement Decisions; Tax
Division's Tax Protester Case List; Tax Division's Criminal Tax Manual; Tax Division's
Criminal Tax Indictment/Information Forms; Tax Division's Standardized Criminal Tax Jury
Instructions; Tax Division's Criminal Section Newsletter; Tax Court Memorandum Decisions;
IRS Cumulative Bulletin; Tax International Acts; IRS News Releases; IRS General Counsel
Memoranda; IRS Actions on Decisions; IRS Technical Memoranda. * MANUALS United States
Attorney's Manual; United States Trustees' Manual; Federal Personnel Manual; Federal
Acquisition Regulations; Federal Acquisition Circulars; Federal Travel Regulation; Federal
Information Resources Management Regulation; Federal Property Management Regulations;
Principles of Federal Appropriations Law; Justice Department Acquisition Regulation; Justice
Property Management Regulations. * DEPARTMENT OF JUSTICE WORKPRODUCTS Civil
Division Monographs; Civil Division Torts Branch Handbook on damages under FTCA;
Criminal Division Monographs; Criminal Division Forms; Criminal Division Guidelines for
Drafting Indictments; Criminal Division Narcotics; Forfeiture, Prosecution Manual; Criminal
Division Directory of Services; Asset Forfeiture Manuals; Obscenity Enforcement Reporter;
Environmental and Natural Resources Division Monographs; US Sentencing Commission's
Guidelines Manual; Sentencing Guidelines Updates. 5. Additional tables describing JURIS
content In preparing the corpus for publication, we have compiled a couple of tables to help
summarize the content of the various file-sets. Two sorts of tabulations were made: counting
documents in each file set according to category of content, and counting documents according
to dates found in the document text. * Tabulation of document categories The tabulation of
document categories is given in "j_categ.tbl"; this table contains one line for each file-set, with
four fields on each line, separated by commas. The fields are as follows: file-set number, range
of dates found within the file set, abbreviated category code, additional commentary or document
titles The second, third and fourth fields are sometimes empty in the table, indicating that the
information was not readily extractable for the given file-set. The abbreviations of categories are
as follows: AL Administrative Law BR Briefs CL Case Law EX Executive Orders FOIA
Freedom of Information Act and related documents FR Federal Regulations IA International
Agreements IL Indian Law PL (?) RE Regulations SL Statutory Law TAX Tax Law (Note that
there is not a perfect correspondence between these category codes and the descriptions of
categories in the previous section; but most of the categories are accounted for here, and
additional relations may be found in the "additional commentary" field of the table.) * Tabulation
of document dates In hopes of providing a better sense of the time epochs covered by the various
file sets, we tabulated all occurrences of 4-digit numeric strings in the text data, whose values
ranged between 1700 and 1990. These occurrences were grouped into bins of about 20 years, on
average, and their frequency of occurrence is tabulated in the file "j_dates.tbl". This table
contains one line for each file-set, and 11 fields on each line, separated by commas and space
characters. The first field is the 4-digit file-set name, the second field shows the total number of
documents in that file set, and the remaining fields show how many documents contained an
apparent reference to a year within each of nine time ranges. The first line of the table contains
column headings, which give the time ranges for each column; the ranges are: 1700-1799, 1800-
39, 1840-69, 1870-99, 1900-19, 1920-39, 1940-59, 1960-69, and 1970 forward. The last line of
the table provides column totals. g abuse act of 1988; Criminal Division Handbook on CCCA;
The Organic Laws of the United States. * TAX LAW US Tax Court Decisions; US Board of Tax
Appeals Decisions; Tax Division's Summons Enforcement Decisions; Tax Division's Tax
Protester Case List; Tax Division's Criminal Tax Manual; Tax Division's Criminal Tax
Indictment/Information Forms; Tax Division's Standardized Criminal Tax Jury Instructions; Tax
Division's Criminal Section Newsletter; Tax Court Memorandum Decisions; IRS Cumulative
Bulletin; Tax International Acts; IRS News Releases; IRS General Counsel Memoranda; IRS
Actions on Decisions; IRS T