A new v0.2.0 release of the tokenizers package for R is now available on CRAN. The tokenizers package provides a fast, consistent set of functions to turn natural language text into tokens. This is a fairly substantial release which is a result of collaboration with Dmitry Selivanov, Os Keyes, Ken Benoit, and Jeffrey Arnold. This version adds these features (see the changelog for more details and attribution):
- A new tokenizer for tweets that preserves usernames, hashtags, and URLS.
- A new tokenizer for Penn Treebank style tokenization.
- A new function to split long documents into pieces of equal length.
- New functions to count words, characters, and sentences without tokenization.
- The package now uses C++98 rather than C++11, so more users will be able to install it without upgrading their compiler. (No more e-mails from CentOS 6 users.)
Most important, the package implements the draft recommendations of the Text Interchange Format. The TIF standards were drafted at the 2017 rOpenSci Text Workshop. They define standards for tokens, corpora, and document-term matrices to allow R text analysis packages to interoperate with one another. I think these standards, once finalized and widely adopted, will be a very positive development for the coherence of the ecosystem of packages around text analysis in R. A new vignette explains how the tokenizers package fits into this ecosystem.