| Type: | Package | 
| Title: | Load WARC Files into Apache Spark | 
| Version: | 0.1.6 | 
| Maintainer: | Edgar Ruiz <edgar@rstudio.com> | 
| Description: | Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This allows to read files from the Common Crawl project http://commoncrawl.org/. | 
| License: | Apache License 2.0 | 
| BugReports: | https://github.com/r-spark/sparkwarc | 
| Encoding: | UTF-8 | 
| Imports: | DBI, sparklyr, Rcpp | 
| RoxygenNote: | 7.1.1 | 
| LinkingTo: | Rcpp, | 
| SystemRequirements: | C++11 | 
| NeedsCompilation: | yes | 
| Packaged: | 2022-01-10 16:40:06 UTC; yitaoli | 
| Author: | Javier Luraschi [aut],
  Yitao Li  | 
| Repository: | CRAN | 
| Date/Publication: | 2022-01-11 08:50:02 UTC | 
Provides WARC paths for commoncrawl.org
Description
Provides WARC paths for commoncrawl.org. To be used with
spark_read_warc.
Usage
cc_warc(start, end = start)
Arguments
start | 
 The first path to retrieve.  | 
end | 
 The last path to retrieve.  | 
Examples
cc_warc(1)
cc_warc(2, 3)
Loads the sample warc file in Rcpp
Description
Loads the sample warc file in Rcpp
Usage
rcpp_read_warc_sample(filter = "", include = "")
Arguments
filter | 
 A regular expression used to filter to each warc entry
efficiently by running native code using   | 
include | 
 A regular expression used to keep only matching lines
efficiently by running native code using   | 
Reads a WARC File into using Rcpp
Description
Reads a WARC (Web ARChive) file using Rcpp.
Usage
spark_rcpp_read_warc(path, match_warc, match_line)
Arguments
path | 
 The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols.  | 
match_warc | 
 include only warc files mathcing this character string.  | 
match_line | 
 include only lines mathcing this character string.  | 
Reads a WARC File into Apache Spark
Description
Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.
Usage
spark_read_warc(
  sc,
  name,
  path,
  repartition = 0L,
  memory = TRUE,
  overwrite = TRUE,
  match_warc = "",
  match_line = "",
  parser = c("r", "scala"),
  ...
)
Arguments
sc | 
 An active   | 
name | 
 The name to assign to the newly generated table.  | 
path | 
 The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols.  | 
repartition | 
 The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning.  | 
memory | 
 Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?)  | 
overwrite | 
 Boolean; overwrite the table with the given name if it already exists?  | 
match_warc | 
 include only warc files mathcing this character string.  | 
match_line | 
 include only lines mathcing this character string.  | 
parser | 
 which parser implementation to use? Options are "scala" or "r" (default).  | 
... | 
 Additional arguments reserved for future use.  | 
Examples
## Not run: 
library(sparklyr)
library(sparkwarc)
sc <- spark_connect(master = "local")
sdf <- spark_read_warc(
  sc,
  name = "sample_warc",
  path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"),
  memory = FALSE,
  overwrite = FALSE
)
spark_disconnect(sc)
## End(Not run)
Loads the sample warc file in Spark
Description
Loads the sample warc file in Spark
Usage
spark_read_warc_sample(sc, filter = "", include = "")
Arguments
sc | 
 An active   | 
filter | 
 A regular expression used to filter to each warc entry
efficiently by running native code using   | 
include | 
 A regular expression used to keep only matching lines
efficiently by running native code using   | 
Retrieves sample warc path
Description
Retrieves sample warc path
Usage
spark_warc_sample_path()
sparkwarc
Description
Sparklyr extension for loading WARC Files into Apache Spark