Count: 2,462,000,000 Words From 67 Languages

Home Download Statistics About The Corpora Links Newspapers  

HC corpora is a collection of corpora for various languages freely available to download.
The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language.
I have strived to search from many different types of sources, such as newspapers, magazines, (personal and professional) blogs and Twitter updates.
Additionally I also have corpora with misspelled words marked. You can find them on SourceForge.

NOTE: I'm having some issues with the file host, so until that is resolved please use the Mediafire download links.

News:

17/11/2013 - New Language - Bosnian
22/06/2013 - New Language - Serbian (Latin)
09/06/2013 - New Language - Nepali
08/06/2013 - New Language - Amharic

For all the latest news, follow hc_corpora on Twitter or read the blog.

Current upload schedule:

In Progess:
Laotian: January 2014

Planned future corpora:

I'm planning to collect corpora for many more languages, but the above ones, are the ones I have planned specifically for.
if you are interested in any other languages you can contact me via e-mail (see below) and I will look into the possibility of assigning the language a higher priority.

In addition to the large corpora I am also collecting corpora for a number of small languages. These are languages that do not have a lot of native speakers (small countries or minority languages) or do not have a lot of online availability (e.g. some 3rd world languages) and thus the corpora will not be of the same size and variety as the larger languages. These will be uploaded as they become available.

Specialised corpora and data mining
I have extensive experience collecting data of all sorts from the internet, ranging from chinese word lists to business and medical corpora. If you have any particular needs please contact me via email. If it's on the net, I can get it for you!

Contact Information: Email Hans Christensen at: hc[dot]corpus [at] gmail (dot) com