The
Oslo
Corpus
of Bosnian Texts: consists of a corpus of approximately 1.6 million
words, encoded with the IMS corpus workbench developed at the Institut
fur Maschinelle Sprachverarbeitung at the University of Stuttgart.
Croatian National Corpus is a
systematized collection of selected texts mainly written in
contemporary Croatian covering different media, genres, styles, fields
and topics.
The
Alpino
Treebank.
The treebank contains syntactically annotated Dutch sentences. The
treebank
(more than 150,000 words) includes the full cdbl (newspaper) part of
the
Eindhoven corpus.
English
British National Corpus
BNC is a
one hundred million word corpus of British English, both spoken and
written.
BNC Baby is a
collection of corpora and software designed to demonstrate the full
potential of corpus linguistics in the teaching of English language and
literature.
VIEW: Variation In English Words and
Phrases This website allows you to quickly and easily search for
a wide range of words and phrases of English in the 100 million word
British National Corpus.
English Section of the
Helsinki Corpus of English Texts,
annotated to
facilitate
searches on lexical items and syntactic structure.The Brown University
Corpus: Approximately 1,000,000 words of American written English
dating
from 1960. The genre categories are parallel to those of the LOB corpus.
CobuildDirect
service: It is an on-line service for accessing a corpus of modern
English language text, written and spoken. You may take out a six-month
or full-year's subscription to CobuildDirect.
Collins
Cobuid:
If
you're interested in the English language -- especially if you are a
teacher
or a learner of English -- then these Web pages are for you. The team
here
at Cobuild works with a huge "corpus" of modern English text on
computer
to analyse language usage.
Corpus of Spoken,
Professional
American-English: The corpus, which has been constructed from a
selection
of existing transcripts of interactions in professional settings,
contains
two main sub-corpora of a million words each. One sub-corpus consists
mainly
of academic discussions such as faculty council meetings and committee
meetings related to testing. The second sub-corpus contains transcripts
of White House press conferences, which are almost exclusively
question-and-answer
sessions. ($49)
The Helsinki Corpus
(Diachronic Part): samples from texts
covering the
Old, Middle, and Early Modern English periods. 1,500,000 words in total.
The
Lancaster/Oslo-Bergen Corpus (LOB): Approximately 1,000,000
words
of
British written English dating from 1960. The corpus is made up of 15
different
genre categories. Available as orthographic text and tagged with the
CLAWS1
part-of-speech tagging system. The Leeds-Lancaster Treebank and
Lancaster
Parsed Corpus are analyzed subsamples of the LOB corpus. (See also
SIGLEX
Treebanks.)
(ftp://ftp.cogsci.ed.ac.uk/pub/corpus-LLC/)
The directory contains the transcriptions of the London-Lund corpus
that
took place as part of the Beach/CoMoPro projects at the University of
Edinburgh,
Centre for Cognitive Science.
The Longman-Lancaster
Corpus: Approximately 14.5 million words
of
written
English from various geographical locations in the English-speaking
world
and of various dates and text types.
Survey of
English
Usage
(SEU) is an English Language research unit, based in the Department of
English Language and Literature at University College London.
The
York-Helsinki
Parsed Corpus of Old English Poetry: The York Poetry Corpus
contains
71,490 words of Old English text; the samples from the longer texts are 4,000 to 17,000
words in length. The texts represent a range of dates of composition
and
authors. The size of the corpus is approximately 2.5 megabytes.
A USENET corpus (2005-2007)
This corpus is a collection of public USENET postings. This corpus was
collected between Oct 2005 and Jan 2007, and covers 47860 English
language, non-binary-file news groups.
CORGA (Reference Corpus of
Present-day Galician Language)
German
the
TIGER project is pleased to announce the release of the first
version of the TIGER Corpus. This treebank consists of app. 700,000
tokens (40,000 sentences) of German newspaper text. It was
semi-automatically tagged with part-of-speech and syntactic structures.
NEGRA
Corpus
: A Syntactically Annotated Corpus of German Newspaper Texts. The
NEGRA corpus consists of approx. 20,602 sentences of German newspaper
text
taken from the Frankfurter Rundschau.
the COSMAS
corpus archive
: more than 1736 million running words, of which over one billion words
can be searched by the general public for free for research purposes
The Bonn
corpus of
Early New High German : The corpus was constructed between 1972 and
1985 at the University of Bonn. It contains 40 texts of app. 30 pages
each.
CORIS/CODIS
a
100-million-word
corpus of contemporary written Italian
BADIP
(Banca dati
dell'italiano parlato) containing an online edition of the 500,000 word
LIP-Corpus.
The edition is enriched with POS-tags and lemmata.
Japanese
SAMANTHA
E„`ROR CORPUS is a compilation of spelling errors produced by 333
Japanese
users of the English language. The age of the subjects ranges from 14
to
59, whose occupations also vary from junior-high school student to
university
teacher.
Malay
Malay
Concordance Project a corpus of classical Malay texts (now
nearly 4 million words, including over 50,000 verses) which can be
searched on-line.
Portugese
NLP
resources
for Portuguese (listing corpora, dictionaries, terminological
databases,
tools and other possible pointers of interest)
ATIS:
an old version of the ATIS corpus which has been compositionally
enriched
with logical-semantic representations.
OVIS treebank:
each
syntactic
node is enriched with a compositional semantic representation. The OVIS
treebank is used in the Dutch Priority programme Language and Speech
Technology,
European
Parliament
Proceedings 1996-2001 : This parallel corpus is extracted from the
proceedings of the European Parliament. It includes versions in 11
European
languages: Romanic (French, Italian, Spanish, Portuguese), Germanic
(English,
Dutch, German, Danish, Swedish), Greek and Finnish.
The JRC-Acquis Multilingual Parallel Corpus : The JRC-Acquis covers the 20 official EU languages plus Romanian. Norwegian
is thus not included, but several other Scandinavian languages are. The
corpus is paragraph-aligned for each of the 190 language pairs.
OPUS
- is an attempt to collect translated texts from the web, to convert
and align the entire collection, to add linguistic data, and to provide
the community with a publicly available parallel corpus
Carmel Project
travel stories of the 19th and early 20th century (Darwin, Loti,
Stendhal, Flaubert, Dickens, London, etc.), translated in English,
French, Italian and Spanish.
Distributors
Linguistic Data Consortium
The
Linguistic Data Consortium is an open consortium of universities,
companies
and government research laboratories. It creates, collects and
distributes
speech and text databases, lexicons, and other resources for research
and
development purposes. The University of Pennsylvania is the LDC's host
institution.
ICAME Corpora:
(International
Computer Archive of Modern and Medieval English)
EUROPEAN
LANGUAGE
RESOURCES
ASSOCIATION The overall goal of ELRA is to provide a centralized
organization
for the validation, management, and distribution of speech, text, and
terminology
resources and tools, and to promote their use within the European
telematics
R&TD community.
Text
Archrive - Text Center
Alex: A
Catalogue of
Electronic Texts on the Internet : Alex allows users to find
and retrieve the full-text of documents on the Internet. It
currently
indexes over 700 books and shorter texts by author and title,
incorporating
texts from Project Gutenberg, Wiretap, the On-line Book Initiative, the
Eris system at Virginia Tech, the English Server at Carnegie Mellon
University,
and the on-line portion of the Oxford Text Archive. For now it
includes
no serials. Alex does include an entry for itself.
American
Memory is
the online resource compiled by the Library of Congress National
Digital
Library Program. With the participation of other libraries and
archives,
the program provides a gateway to rich primary source materials
relating
to the history and cultural developments of the United States. Over one
million items from our historical collections are currently available
online.
Berkeley Digital Library
The
Berkeley
Digital Library SunSITE builds digital collections and services while
providing
information and support to others doing the same. We are sponsored by
The
Library, UC Berkeley and Sun Microsystems, Inc.
CCAT (Center
for
Computer
Analysis of Texts) at the University of Pennsylvania has one of the
biggest
archives of ready-to-use e-texts. [downloadable]
Center for Electronic Texts
serves all U.S. scholars, researchers and teachers involved with the
creation
and use of electronic text applications in the humanities.
Center for Electronic Text
in
the
Law: CETL currently produces two text databases that can be
accessed
from the Internet. The first is the University of Cincinnati's portion
of DIANA, a unique database of human rights materials. The second
database,
the Securities Lawyer's Deskbook, provides electronic acc ess from the
Internet to the text of the Securities Act of 1933 and the Securities
Exchange
Act of 1934, together with the rules and forms necessary for compliance
with these statutes.
Christian Classics Ethereal*
Library
Classic Christian books in electronic format, selected for your
edification.
There is enough good reading material here to last you a lifetime, if
you
give each work the time it deserves! All of the books on this server
are
believed to be in the public domain in the United States unless
otherwise
specified.
The English Server, The
English
Server is a cooperative which has been publishing humanities texts
online
since 1990. Today it offers over eighteen thousand works, covering a
wide
range of interests. [downloadable]
The Internet Classics Archive:
an
award-winning, searchable collection of over 400 classical Greek and
Latin
texts (in English translation)
The Labyrinth:
The
Labyrinth
is a global information network providing free, organized access to
electronic
resources in medieval studies through a World Wide Web server at
Georgetown
University.
Online Book Initiative:
The
OBI
is a project to make a large collection of freely redistributable text
available in a common format for others to do with as they like.
[downloadable]
The Online Book
Page:
The On-Line Books Page is a directory of books that can be freely read
right on the Internet. [downloadable]
Oxford Text Archrive: The
Archive
has
been collecting electronic texts for some twenty years from a wide
variety
of sources, and its holdings reflect the diversity of this medium.
[downloadable]
Project Gutenberg The Project
Gutenberg
Philosophy is to make information, books and other materials available
to the general public in forms a vast majority of the computers,
programs
and people can easily read, use, quote, and search. (FTP
site)
MonoConc
: Michael
Barlow's concordance program. It's very fast, easy to use, and can
process
Thai texts.
WCONCORD:
This link leads you to WinConcordancer, a concordancing programme for
Windows
developed at the TH Darmstadt Dept. of Linguistics and Literaure. The
program
was developed by Zdenek Martinek from the University of West Bohemia,
Pilsen,
Czech
Republic, in close collaboration with Les Siegrist, from the Technische
Universität Darmstadt, Germany.
The CLAN
Program:
The CLAN program is available for five platforms.
Conc (Mac)
Non-commercial
interactive concordancer, developed by the Summer Institute of
Linguistics.
ConcApp :
Concapp Concordance
Browser and Editor for Windows 95 / 98 / NT
Concordance
of Asian Newspaper English : This concordance has been produced
from
a corpus of newspaper reports published in 18 Asian countries
between
September and November 2000. Samples of around 6,000 words were
collected
from the Internet versions of newspapers.
KWiCFinder
: another web concordance, requires Windows 95/98/ME & Internet
Explorer
5.0 or greater
Spaceless.com's
Concordancer :
Concordancer takes the text of a web page and creates a list
of sentences that contain the search term.
TACT
is a
text-analysis and retrieval system for MS-DOS that permits inquiries on
text databases in European languages.
WebCONC
is a tool for generating KWIK-concordances based on webpages. There are
two options for defining your corpus: let Google search the relevant
webpages
for you or define the URLs that will be used yourself.
WebCorp : an
Internet
search
tool which allows on-line access to Web texts as linguistic rather than
information sources.
Word
Smith
An integrated suite of programs for looking at how words behave in
texts.
TnT --
Statistical Part-of-Speech
Tagging: is a very efficient statistical part-of-speech
tagger
that is trainable on different languages and virtually any tagset. The
component for parameter generation trains on tagged corpora. The system
incorporates several methods of smoothing and of handling unknown words.
TOSCA/LOB
tagger
This tagger takes as input English text, possibly containing SGML
markup,
and produces tagged text, both in a multi-column and in an SGML/TEI
format.
The tagset used is basically the LOB tagset (about 130 tags plus
ditto-tags),
although with a few very slight adjustments.
EngCG-2
tagger
: EngCG-2 is a program that assigns morphological and
part-of-speech
tags to words in English text.
AUTASYS
- A Fully Automatic English Wordclass Analysis System : AUTASYS is
a menu-driven automatic tagging and lemmatising system that analyses
English
texts at word-class level with the Lancaster-Oslo-Bergen (LOB) tagset,
the International Corpus of English (ICE) tagset, and the “skeleton”
tagset
(SKELETON), which is the set of base tags from ICE without features.
Morphy
:
German
morphology and part-of-speech tagging (Win 95/NT)
Parsers
Conexor's
analysers
for
English: sell fast and accurate programs for tagging and parsing
English
texts: Conexor Constraint Grammar of English (EngCG-2 tagger) Conexor
Light
Syntax of English (EngLite parser) Conexor Functional Dependency
Grammar
of English (FDG parser)
The Link Grammar
Parser
is a syntactic parser of English, based on link grammar, an original
theory
of English syntax. Given a sentence, the system assigns to it a
syntactic
structure, which consists of set of labeled links connecting pairs of
words.
Apple Pie
Parser
The parser is a bottom-up probabilistic chart parser which finds the
parse
tree with the best score by best-first search algorithm. Its grammar
(of
English) in the distribution is a semi-context sensitive grammar with
two
non-terminals and it was automatically extracted from Penn Tree Bank,
syntactically
tagged corpus made at the University of Pennsylvania.
XIARA : XML Aware Indexing
and Retrieval Architecture
PIE : PIE incorporates a database
derived from the second or World Edition of the BNC (2000), but is not
affiliated with the BNC Consortium. It aims to provide a simple yet
powerful interface for studying words and phrases up to six words long
Unitext
: Unitex is a
corpus processing system, based on automata-oriented technology.
The concept of this software was born at LADL (Laboratoire
d'Automatique Documentaire et Linguistique), under the direction of
its director, Maurice Gross. With this tool, you can handle
electronic resources such as electronic dictionaries and grammars
and apply them. You can work at the levels of morphology, the
lexicon and syntax.
Multilingual
Text
Tools and Corpora Multext encompasses a series of projects whose
goals
are to develop standards and specifications for the encoding and
processing
of linguistic corpora, and to develop tools, corpora and linguistic
resources
embodying these standards.
The
Alembic Workbench project has as its goal the creation of a
natural
language engineering environment for the development of tagged corpora.
Qwick
is a
corpus
browser that allows you to build up your own working corpus, retrieve
concordance lines using a simple but powerful query language, and to
compute collocation statistics using a variety of adjustable
parameters.
The
IMS
Corpus
Toolbox Webpage is a set of tools for administering, indexing, and
querying (retrieval) of large text corpora. (for SunOS 4.1.x and SunOS
>5.4 (and higher, aka Solaris >2.4), and Linux)
LTG Software:
The
Language
Technology Group makes available various software packages. For
research
purposes, these are often available for free to academic research
groups
and for a small fee to industrial R&D groups.
GATE
is a software environment that supports researchers in Natural Language
Processing (NLP) and Computational Linguistics (CL) and developers who
are producing and delivering Language Engineering (LE) systems.
The
Pizza
Chef:
a TEI Tag Set Selector: These pages will help you design your own
TEI-conformant
document type definition. You will be able to select the TEI tag sets
you
need to make up your very own view of the TEI dtd, including your own
modifications
and restrictions.
SGML
Query
Language:
SgmlQL is a programming language for the manipulation of SGML/HTML
documents.
XED: An XML
document instance
editor: XED is a text editor for XML document instances. It is
designed
to support hand-authoring of small-to-medium size XML documents, and is
optimised for keyboard input. It works very hard to ensure that you
cannot
produce a non-well-formed document.
Xlex/www is a
web-based
interface
to a suite of Unix/Linux command line tools (tokenizer, index, POS
tagger,
concordancer, statistical analysis, etc.)
Language
Technology
Group
(LTG) The LTG is a technology transfer group working in the area of
natural language engineering. Based in Edinburgh's Human Communication
Research Centre, it can draw on the skills and expertise of one of the
largest communities of natural language processing specialists in
Europe.
Text Encoding
Initiative
: is an international project to develop guidelines for the
preparation
and interchange of electronic texts for scholarly research, and to
satisfy
a broad range of uses by the language industries more generally.
Library of
Congress EAD
(Encoded Archival Description) The Library of Congress has been active
in developing and testing markup of archival finding aids using Encoded
Archival Description, a document type definition (DTD) of Standard
Generalized
Markup Language (SGML).
ARCADE
is
a part
of the Concerted Research Action (ARC) of the ILEC group (Ingénierie de
la Langue--Linguistique--Informatique et Corpus écrits). It has started
in 1995 in order to promote research in the field of multi-lingual
alignment.
This first 2-year period (95-96) was dedicated to the achievement of
these
two main tasks : the production of a large, rich and standardised
bilingual
(French-English) corpus suited for the alignment task ; the reflection
and the phasing-in of a protocol in order to objectively evaluate
alignment
systems.
European
Language
Resources
Association ELRA was established in Luxembourg in February, 1995,
with
the goal of founding an organization to promote the creation,
verification,
and distribution of language resources in Europe.
ECI: European
Corpus Initiative
The aim was to produce a reasonably large text corpus of the major
European
languages for the linguistic research community. The majority of the
work
wasdone at the HCRC in Edinburgh and at ISSCO, University of Geneva.
The
ECI/MCI corpus has now been published on CD-ROM, and contains almost
100
million words in 27 (mainly European) languages. It consists of 48
component
corpora marked up in SGML, with easy access to the source text without
markup. 12 of the component corpora are multilingual parallel corpora
with
from two to nine sub-corpora
Trans-European
Language
Resources
(TELRI) is a European Commission-funded initiative which is
creating
a viable infrastructure between leading European language and language
technology centres in order to provide a platform for industry,
research
institutes and universities and to supply the NLP community with
precompetitive
/ public domain monolingual and multilingual language resources.
TRACTOR: TELRI Research
Archive of
Computational Tools and Resources
Interval
A European project (LE2-4002/10380) on the validation of terminology
resources,
co-financed by the EC (DG XIII) within the framework of the Language
Engineering
programme, which addresses the following issues: Validation of
Multilingual,
terminology Resource, Validation methodologies and software toolkit,
World
Wide Web dissemination,CD-ROM data banks
The
SGML/XML Web
Page is a comprehensive online database containing reference
information
and software pertaining to the Standard Generalized Markup Language
(SGML)
and its subset, the Extensible Markup Language (XML).
English
Language
Corpora
Homepage This page lists centres and projects from which language
corpora
(chiefly English language) are readily available.
Library
Electronic Text
Resource Service (LETRS) serves as a focal point for members of the
Indiana U community interested in identifying, acquiring, and using
electronic
resources for humanities research and teaching.
Guide
to Digital Resource 1996-1998 : This is the fourth edition of the
CTI
Textual Studies Guide to Digital Resources. The Guide aims to give an
overview
of digital resources which may have application for Higher Education
teaching
and research in the disciplines supported by the Centre: Literary
Studies
in all languages and periods, Literary Linguistics, Philosophy,
Theology
& Religious Studies, Classics, Film and Media Studies, and Drama).
UCREL
Technical
Papers papers fall into two categories: (1) articles dealing with
corpora
and computational linguistics and (2) corpus manuals.
Technology in Language
Learning
: A
site for teachers interested in the use of computers and
telecommunications
for English language education.