New release: tokenizers v0.2.0

A new v0.2.0 release of the tokenizers package for R is now available on CRAN. The tokenizers package provides a fast, consistent set of functions to turn natural language text into tokens. This is a fairly substantial release which is a result of collaboration with Dmitry Selivanov, Os Keyes, Ken Benoit, and Jeffrey Arnold. This version adds these features (see the changelog for more details and attribution):

Most important, the package implements the draft recommendations of the Text Interchange Format. The TIF standards were drafted at the 2017 rOpenSci Text Workshop. They define standards for tokens, corpora, and document-term matrices to allow R text analysis packages to interoperate with one another. I think these standards, once finalized and widely adopted, will be a very positive development for the coherence of the ecosystem of packages around text analysis in R. A new vignette explains how the tokenizers package fits into this ecosystem.

Finally, the package now has a new website for vignettes and documentation, thanks to pkgdown. What the package does not have is a nice hex-sticker logo, but perhaps that can come in due time.

comments powered by Disqus