Count: 2,462,000,000 Words From 67 Languages
About The Corpora
HC corpora is a collection of corpora for various languages
freely available to download.
The corpora have been collected from numerous different
webpages, with the aim of getting a varied and comprehensive corpus
of current use of the respective language.
I have strived to search from many different types of sources, such
as newspapers, magazines, (personal and professional) blogs and
Additionally I also have corpora with misspelled words marked. You can find them on SourceForge.
NOTE: I'm having some issues with the file host, so until that is resolved please use the Mediafire download links.
17/11/2013 - New Language - Bosnian
22/06/2013 - New Language - Serbian (Latin)
09/06/2013 - New Language - Nepali
08/06/2013 - New Language - Amharic
For all the latest news, follow hc_corpora on Twitter or read the blog.
Current upload schedule:
Laotian: January 2014
Planned future corpora:
I'm planning to collect corpora for many more languages, but the above
ones, are the ones I have planned specifically for.
if you are interested in any other languages you can contact
me via e-mail (see below) and I will look into the possibility of
assigning the language a higher priority.
In addition to the large corpora I am also collecting corpora for a number of
small languages. These are languages that do not have a lot of native speakers
(small countries or minority languages) or do not have a lot of online availability
(e.g. some 3rd world languages) and thus the corpora will not be of the
same size and variety as the larger languages. These will be uploaded as they
Specialised corpora and data mining
I have extensive experience collecting data of all sorts from the internet, ranging from chinese word lists to business and medical corpora. If you have any particular needs please contact me via email. If it's on the net, I can get it for you!
Contact Information: Email Hans Christensen at: hc[dot]corpus [at] gmail (dot) com