Count: 2,585,000,000 Words From 67 Languages

Home Download Statistics About The Corpora Links Newspapers  

HC corpora is a collection of corpora for various languages freely available to download.
The corpora have been collected from numerous different webpages, with the aim of getting a varied and comprehensive corpus of current use of the respective language.
I have strived to search from many different types of sources, such as newspapers, magazines, (personal and professional) blogs and Twitter updates.
Additionally I also have corpora with misspelled words marked. You can find them on SourceForge.


05/05/2016 - Language Update - Kannada
12/03/2016 - Language Update - Armenian
08/03/2016 - Language Update - Norwegian
29/02/2016 - Language Update - Portuguese (Brazil)
23/02/2016 - Language Update - Estonian
11/02/2013 - Language Update - Czech

For all the latest news, follow hc_corpora on Twitter or read the blog.

Current upload schedule:

In Progess:

Planned future corpora:

I'm planning to collect corpora for many more languages, but the above ones, are the ones I have planned specifically for.
if you are interested in any other languages you can contact me via e-mail (see below) and I will look into the possibility of assigning the language a higher priority.

In addition to the large corpora I am also collecting corpora for a number of small languages. These are languages that do not have a lot of native speakers (small countries or minority languages) or do not have a lot of online availability (e.g. some 3rd world languages) and thus the corpora will not be of the same size and variety as the larger languages. These will be uploaded as they become available.

Specialised corpora and data mining
I have extensive experience collecting data of all sorts from the internet, ranging from chinese word lists to business and medical corpora. If you have any particular needs please contact me via email. If it's on the net, I can get it for you!

Contact Information: Email Hans Christensen at: hc[dot]corpus [at] gmail (dot) com