Given a text or vector/list of texts, break the texts into smaller segments each with the same number of words. This allows you to treat a very long document, such as a novel, as a set of smaller documents.

chunk_text(x, chunk_size = 100, doc_id = names(x), ...)

Arguments

x

A character vector or a list of character vectors to be tokenized into n-grams. If x is a character vector, it can be of any length, and each element will be chunked separately. If x is a list of character vectors, each element of the list should have a length of 1.

chunk_size

The number of words in each chunk.

doc_id

The document IDs as a character vector. This will be taken from the names of the x vector if available. NULL is acceptable.

...

Arguments passed on to tokenize_words.

Details

Chunking the text passes it through tokenize_words, which will strip punctuation and lowercase the text unless you provide arguments to pass along to that function.

Examples

chunked <- chunk_text(mobydick, chunk_size = 100) length(chunked)
#> [1] 2195
chunked[1:3]
#> $`mobydick-0001` #> [1] "the project gutenberg ebook of moby dick or the whale by herman melville this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever you may copy it give it away or re use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org title moby dick or the whale author herman melville last updated january 3 2009 posting date december 25 2008 ebook 2701 release date june 2001 language english start of this project gutenberg ebook moby dick or the whale produced by daniel lazarus" #> #> $`mobydick-0002` #> [1] "and jonesey moby dick or the whale by herman melville original transcriber's notes this text is a combination of etexts one from the now defunct eris project at virginia tech and one from project gutenberg's archives the proofreaders of this version are indebted to the university of adelaide library for preserving the virginia tech version the resulting etext was compared with a public domain hard copy version of the text in chapters 24 89 and 90 we substituted a capital l for the symbol for the british pound a unit of currency etymology supplied by a late consumptive usher to" #> #> $`mobydick-0003` #> [1] "a grammar school the pale usher threadbare in coat heart body and brain i see him now he was ever dusting his old lexicons and grammars with a queer handkerchief mockingly embellished with all the gay flags of all the known nations of the world he loved to dust his old grammars it somehow mildly reminded him of his mortality while you take in hand to school others and to teach them by what name a whale fish is to be called in our tongue leaving out through ignorance the letter h which almost alone maketh the signification of the" #>