Convert natural language text into tokens. The tokenizers have a consistent interface and are compatible with Unicode, thanks to being built on the 'stringi' package. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, lines, and regular expressions.

Documentation

Manual: tokenizers.pdf
Vignette: Introduction to the tokenizers Package

Maintainer: Lincoln Mullen <lincoln at lincolnmullen.com>

Author(s): Lincoln Mullen*, Dmitriy Selivanov*

Install package and any missing dependencies by running this line in your R console:

install.packages("tokenizers")

Depends R (>= 3.1.3)
Imports stringi(>=1.0.1), Rcpp(>=0.12.3), SnowballC(>=0.5.1)
Suggests testthat, covr, knitr, rmarkdown
Enhances
Linking to Rcpp
Reverse
depends
Reverse
imports
ptstem, tidytext
Reverse
suggests
cleanNLP
Reverse
enhances
Reverse
linking to

Package tokenizers
Materials
URL https://github.com/ropensci/tokenizers
Task Views NaturalLanguageProcessing
Version 0.1.4
Published 2016-08-29
License MIT + file LICENSE
BugReports https://github.com/ropensci/tokenizers/issues
SystemRequirements
NeedsCompilation yes
Citation
CRAN checks tokenizers check results
Package source tokenizers_0.1.4.tar.gz