$push()
: Store a dataset to Elasticsearch$pull()
: Download a dataset from Elasticsearch$list()
: List all Elasticsearch indices$columns()
: List all columns of an Elasticsearch index$count()
: Count the number of elements$keys()
: List all unique keys of an Elasticsearch index column$has()
: Test if an Elasticsearch index exists$match()
: Select matching Elasticsearch indices$export()
: Extract Elasticsearch index content to a file$import()
: Get a file content to a new Elasticsearch index$move()
: Rename an index$copy()
: Copy an index$delete()
: Delete an Elasticsearch index$search()
: Search everything$stats()
: base statistics of columns$describe_index()
and $describe_columns()
: get the description of index and columns$search()
behaviortext
and keyword
querying$push()
detailsdplyr
functionsWelcome to kibior
package introduction vignette!
As one of the hot topics in science, being able to make findable, accessible, interoperable and researchable our datasets (FAIR principles) brings openness, versionning and unlocks reproductibility. To support that, great projects such as biomaRt R package enable fast consumption and ease handling of massive validated data through a small R interface.
Even though main entities such as Ensembl or NBCI avail massive amounts of data, they do not provide a way to store data elsewhere, delegating data handling to research teams. During data analysis, this can be an issue since researchers often need to send intermediary subsets of analyzed data to collaborators. Moreover, it is pretty common now that, when a new database or dataset emerges, a web platform and an API are provided alongside it, allowing easier exploration and querying.
Multiplying the number of research teams in life-science worldwide with the ever-growing database and datasets publication on widely varying sub-columns results in an even greater number of ways to query heterogenous life-science data.
Here, we present an easy way for datasets manipulation and sharing throught decentralization
. Indeed, kibior
seeks to make available a search engine and distributed database system for sharing data easily through the use of Elasticsearch (ES) and Elasticsearch-based architectures such as Kibio.
It is a way to handle large datasets and unlock the possibility to:
The following sections will explain some basic and advanced technical usage of kibior
. A second vignette will focus these features to biological applicaitons.
We will use both Elasticsearch and R vocabulary, which have similar notions:
R | Elasticsearch |
---|---|
data(set), tibble, df, etc. | index |
columns, variables | fields |
lines, observations | documents |
kibior
uses tibbles as main data representation.
The public Kibio
instance is available at kibio.compbio.ulaval.ca
port 80
. You can simply connect to it via the get_kibio_instance()
method of kibior
.
Before going to the second separate vignette showing biological datasets example
, we strongly advise the reader to start reading the basic
and advanced
usage sections. In these sections, we will use some datasets taken from other known packages, such as dplyr::starwars
…
name <chr> | height <int> | mass <dbl> | hair_color <chr> | skin_color <chr> | eye_color <chr> | birth_year <dbl> | sex <chr> | gender <chr> | |
---|---|---|---|---|---|---|---|---|---|
Luke Skywalker | 172 | 77 | blond | fair | blue | 19.0 | male | masculine | |
C-3PO | 167 | 75 | NA | gold | yellow | 112.0 | none | masculine | |
R2-D2 | 96 | 32 | NA | white, blue | red | 33.0 | none | masculine | |
Darth Vader | 202 | 136 | none | white | yellow | 41.9 | male | masculine | |
Leia Organa | 150 | 49 | brown | light | brown | 19.0 | female | feminine |
…dplyr::storms
…
name <chr> | year <dbl> | month <dbl> | day <int> | hour <dbl> | lat <dbl> | long <dbl> | status <chr> | category <ord> | wind <int> | |
---|---|---|---|---|---|---|---|---|---|---|
Amy | 1975 | 6 | 27 | 0 | 27.5 | -79.0 | tropical depression | -1 | 25 | |
Amy | 1975 | 6 | 27 | 6 | 28.5 | -79.0 | tropical depression | -1 | 25 | |
Amy | 1975 | 6 | 27 | 12 | 29.5 | -79.0 | tropical depression | -1 | 25 | |
Amy | 1975 | 6 | 27 | 18 | 30.5 | -79.0 | tropical depression | -1 | 25 | |
Amy | 1975 | 6 | 28 | 0 | 31.5 | -78.8 | tropical depression | -1 | 25 |
…datasets::iris
…
Sepal.Length <dbl> | Sepal.Width <dbl> | Petal.Length <dbl> | Petal.Width <dbl> | Species <fct> | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
…and ggplot2::diamonds
to show our examples.
carat <dbl> | cut <ord> | color <ord> | clarity <ord> | depth <dbl> | table <dbl> | price <int> | x <dbl> | y <dbl> | z <dbl> |
---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
0.31 | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 |
Before starting, you should know that this step will start an Elasticsearch service and store all data on your machine.
So, you should ponder the quantity of data you will handle in your code according the remaining space left on your computer.
To use this feature, you will need Docker
and docker-compose
installed on your system.
To install Docker
, simply follow the steps detailled on its website.
If you are on a Linux / Unix-based system, you should also check the post-installation steps, mainly for the Manage Docker as a non-root user step.
To install docker-compose
, simply follow the next steps.
We want something easy to use, so we use the following docker-compose
fashion. You can use the docker
way by passing all parameters inline but it is verbose.
You can find the following described files in the
kibior
package, folderinst/docker_conf
.
Copy-paste these lines in a new elasticsearch.yml
file.
cluster.name: "docker-cluster"
network.host: 0.0.0.0
# minimum_master_nodes need to be explicitly set when bound on a public IP
# set to 1 to allow single node clusters
# Details: https://github.com/elastic/elasticsearch/pull/17288
discovery.zen.minimum_master_nodes: 1
# Uncomment and tweak the following lines if you need to connect to remote instances
# such as Kibio's or if you want to configure several disjoint local instances.
# This also allows to use KibioR `$copy()` and `$move()` methods with remote instances.
# reindex.remote.whitelist: [
# "first_instance:9200",
# "second_instance:9200",
# ]
Copy-paste these lines in a new resolv.conf
file if you need to connect to ES named services on the web.
Copy-paste these lines inside a single-es.yml
file.
version: '2.4'
services:
## --------------------------
## If you need rstudio
## --------------------------
# rstudio4:
# container_name: rstudio4
# image: rocker/rstudio:4.0.3
# environment:
# - PASSWORD=myrstudio
# - USERID=1000
# #
# volumes:
# - type: bind
# source: <path_for_RStudio_data_folder_on_your_computer>
# target: /work/rstudio/data # we create a folder inside the container
# read_only: false
# #
# ports:
# - 8787:8787
# networks:
# - kibiornet
# # cpu and ram constraints
# cpu_count: 1
# cpu_percent: 75
# cpus: 0.75
# memswap_limit: 0
# mem_reservation: 256m
# mem_limit: 6g
## --------------------------
## If you need a bash cli + R cli
## See https://hub.docker.com/u/rocker for more versions
## with preinstalled material (e.g. tidyverse)
## --------------------------
# r4:
# container_name: r4
# image: roncar/kibior-env:4.0.3 # pre-configured R version 4.0.3 with Kibior installed
# stdin_open: true
# tty: true
# entrypoint: "/bin/bash"
# #
# volumes:
# - type: bind
# source: <path_for_R_data_folder_on_your_computer>
# target: /work/r/data # we create a folder inside the container
# read_only: false
# - type: bind
# source: ./resolv.conf
# target: /etc/resolv.conf
# read_only: false
# #
# networks:
# - kibiornet
# # cpu and ram constraints
# cpu_count: 1
# cpu_percent: 75
# cpus: 0.75
# memswap_limit: 0
# mem_reservation: 256m
# mem_limit: 6g
## --------------------------
## Elasticsearch container
## --------------------------
elasticsearch:
# this configuration will run a service called "elasticsearch"
container_name: elasticsearch
# the elasticsearch image used will be version 7
# but you can use another version, such as 6.8.6
image: docker.elastic.co/elasticsearch/elasticsearch:7.10.2
# defines env var
# last line tells us java will use 512MB
# if you need more, change it for 2GB, for instance
# "ES_JAVA_OPTS=-Xms2g -Xmx2g"
environment:
- discovery.type=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
# strict limit to 1GB of RAM
mem_limit: 1g
memswap_limit: 0
# lock memory
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 65536
hard: 65536
# bind files and folders of your system with those inside of the container
volumes:
# ES data folder
- type: bind
source: <path_for_es_data_folder_on_your_computer>
target: /usr/share/elasticsearch/data
read_only: false
# ES configurations
- type: bind
source: ./elasticsearch.yml
target: /usr/share/elasticsearch/config/elasticsearch.yml
read_only: true
# export port to access Elasticsearch service from outside docker
ports:
- 9200:9200
# networks managed by docker
networks:
- kibiornet
# network declaration
networks:
kibiornet:
Now, run the configuration to launch the service(s) with:
# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
Starting elasticsearch ... done
# see the current docker processes
➜ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
40814036d980 docker.elastic.co/elasticsearch/elasticsearch:7.10.2 "/tini -- /usr/local…" 30 minutes ago Up 5 seconds 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
# curl
➜ curl -X GET localhost:9200
{
"name" : "40814036d980",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "InZqVTNiTK6idAWrEweWDg",
"version" : {
"number" : "7.10.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
"build_date" : "2021-01-13T00:42:12.435326Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
The Elasticsearch service will be accessible. You can also interact with it on any browser. Check http://localhost:9200
.
If you have R
installed on your computer, simply use it with a kibior
instance pointing at localhost:9200
. Since it is the default configuration, you will only need this to work:
If you do not have R
installed on your computer, you can:
Docker
and docker-compose
.The following sections guide you to use the R cli or the RStudio container. Both have kibior
and its dependencies installed, but you can choose to use a clean R
environment instead (i.e. rocker
containers).
Steps:
es-single.yml
file.# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
elasticsearch is up-to-date
Creating r4 ... done
# see the current docker processes
➜ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
0f1afd07f58a roncar/kibior-env:4.0.3 "/bin/bash" 4 minutes ago Up 4 minutes r4
40814036d980 docker.elastic.co/elasticsearch/elasticsearch:7.10.2 "/tini -- /usr/local…" 4 minutes ago Up 4 minutes 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
# open an interactive bash inside the R container (see previous command container ID)
➜ docker exec -it 0f1afd07f58a bash
# inside the R container, query the ES container (with its container name)
root@0f1afd07f58a:/$ curl -X GET "http://elasticsearch:9200"
{
"name" : "20f2383b909a",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "InZqVTNiTK6idAWrEweWDg",
"version" : {
"number" : "7.10.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "747e1cc71def077253878a59143c1f785afa92b9",
"build_date" : "2021-01-13T00:42:12.435326Z",
"build_snapshot" : false,
"lucene_version" : "8.7.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
# inside the R container, run R cli
root@0f1afd07f58a:/$ R --vanilla
R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(kibior)
# Here you can directly load kibior as it is pre-installed inside the container.
This container comes with R
version 4.0.3
and kibior
package and its dependencies pre-installed. If you need a clean container with only R
, you can use the rocker/r-ver:4.0.3
image instead.
Steps:
es-single.yml
file.# run services (daemonized)
➜ docker-compose -f single-es.yml up -d
elasticsearch is up-to-date
Creating rstudio4 ... done
# see the current docker processes
➜ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
62344a365b70 roncar/kibior-rstudio:4.0.3 "/init" 7 seconds ago Up 5 seconds 0.0.0.0:8787->8787/tcp rstudio4
111ebcf0d5c4 docker.elastic.co/elasticsearch/elasticsearch:7.10.2 "/tini -- /usr/local…" 7 seconds ago Up 5 seconds 0.0.0.0:9200->9200/tcp, 9300/tcp elasticsearch
Connect with your web browser at localhost:8787
with login/password that where configured in the es-single.yml
file.
This container comes with RStudio
version 4.0.3
and kibior
package and its dependencies pre-installed. If you need a clean container with only RStudio
, you can use the rocker/rstudio:4.0.3
image instead.
You can use several type of initialization:
#> Initiate a remote connection
kc_remote <- kibior$new(host = "something-far", user = "foo", pwd = "bar")
#> Create an new local instance bound to your local Elasticsearch
#> By default, `kibior uses localhost isntance with 9200 port
kc_local <- kibior$new()
#> you may need to authenticate since Elasticsearch uses auth system
#> the default login/password is "elastic"/"changeme", so
kc_local <- kibior$new(user = "elastic", pwd = "changeme")
#> You can now use `kc_local` as your own private instance.
To stop the service, simply enter the command:
# stop all services
➜ docker-compose -f single-es.yml down
Stopping r4 ... done
Stopping elasticsearch ... done
Removing r4 ... done
Removing elasticsearch ... done
Removing network docker_kibior_test_kibiornet
Here, we will see the main methods (push()
, pull()
, list()
, columns()
, keys()
, has()
, match()
, export()
, import()
, move()
, copy()
) and public attributes (verbosity) of kibior
class. kibior
uses elastic
(Chamberlain 2020) to perform base functions.
By default, kibior
comes with three public attributes: $verbose
, $quiet_progress
and $quiet_results
all initiliazed to FALSE
.
$verbose
toggles the printing of more informations which can be useful to see all processes steps.$quiet_progress
toggles the printing of progress bars. This can be useful for scripts.$quiet_results
toggles the verbosity output of called methods. You may want to deactivate it when you do not need interactive feedback.To quickly show them, simply print the instance you are using:
kc
## KibioR client:
## - host: elasticsearch
## - port: 9200
## - verbose: no
## - print result: yes
## - print progressbar: yes
Use kc$<attribute-name> <- TRUE/FALSE
to toggle verbosity mode on these three attributes.
A new instance of kibior
has defaults to interactive behavior: progress bar and results immediate printing, but no additional informations.
See Attribute access
in Advanced usage
section for all attribute descriptions.
$push()
: Store a dataset to ElasticsearchTo store data using kc
connection:
kc$push(dplyr::storms, "storms")
## [1] "storms"
# or magrittr style
dplyr::starwars %>% kc$push("starwars")
## [1] "starwars"
If not already taken, the given index name will be created automatically before receiving data. If already taken, an error is raised.
Important points:
$push()
automatically send data to Elasticsearch server, which needs unique IDs. One can define its own IDs using theid_col
parameter which requires a column name that has unique elements.- If not defined,
kibior
will attribute akid
column counter as unique IDs (default).$push()
expects well-formatted data, mainly in a data.frame or derivative structure such as tibble.
See Push modes
in Advanced usage
section for more information.
$pull()
: Download a dataset from ElasticsearchThe $pull()
method downloads datasets. It can retrieve all or parts of datasets.
Results are stored in a list of tibbles.
s$storms
## # A tibble: 10,010 x 14
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <int>
## 1 Ike 2008 9 7 18 21 -74 hurri… 3 105 946
## 2 Ike 2008 9 8 0 21.1 -75.2 hurri… 4 115 945
## 3 Ike 2008 9 8 2 21.1 -75.7 hurri… 4 115 945
## 4 Ike 2008 9 8 6 21.1 -76.5 hurri… 3 100 950
## 5 Ike 2008 9 8 12 21.1 -77.8 hurri… 2 85 960
## 6 Ike 2008 9 8 18 21.2 -79.1 hurri… 1 75 964
## 7 Ike 2008 9 9 0 21.5 -80.3 hurri… 1 70 965
## 8 Ike 2008 9 9 6 22 -81.4 hurri… 1 70 965
## 9 Ike 2008 9 9 12 22.4 -82.4 hurri… 1 70 965
## 10 Ike 2008 9 9 14 22.6 -82.9 hurri… 1 70 965
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>, kid <int>
With this, we can use search patterns to return multiple indices at once.
See Pattern search
in Advanced usage
section for more information.
$list()
: List all Elasticsearch indices$columns()
: List all columns of an Elasticsearch index$count()
: Count the number of elements#> count all lines
kc$count("storms")
## $storms
## [1] 10010
#> count all columns
kc$count("storms", type = "variables")
## $storms
## [1] 14
#> count all indices lines via a pattern
kc$count("s*")
## $starwars
## [1] 87
##
## $storms
## [1] 10010
As $search()
and $pull()
, this method accepts a query
parameter to count the number of hits in your dataset following a query. See Querying
in Advanced usage
section for more information.
$keys()
: List all unique keys of an Elasticsearch index column#> list all keys on integer column
kc$keys("storms", "year")
## [1] 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## [24] 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
#> list all keys on string column
kc$keys("storms", "status")
## [1] "hurricane" "tropical depression" "tropical storm"
You should not use this on columns that can represent a continuous range such as temperature or coordinate. It will aggregate all possible values which can a large amount of time if your dataset is big enough.
$has()
: Test if an Elasticsearch index exists$match()
: Select matching Elasticsearch indices#> get exact matching indices
kc$match("storms")
## [1] "storms"
kc$match("abcde")
## NULL
#> get matching pattern indices
kc$match("s*")
## [1] "starwars" "storms"
#> get list of mixed pattern and non pattern matching indices
c("s*", "abcde") %>% kc$match()
## [1] "starwars" "storms"
$match()
and $has()
differ on some points:
$has()
retuns TRUE
or FALSE
for any string passed.$has()
does not accept patterns and only looks if the given strings are in $list()
.$match()
only returns something if some indices match the given strings.$match()
accepts patterns and unpacks all possible indices matching given strings.$export()
: Extract Elasticsearch index content to a fileThe $export()
method create file and export in-memory dataset
or Elasticsearch index
to this file.
#> Create temp files with data
storms_memory_tmp <- tempfile(fileext=".csv")
storms_elastic_tmp <- tempfile(fileext=".csv")
#> export a in-memory dataset to a file
dplyr::storms %>% kc$export(data = ., filepath = storms_memory_tmp)
## [1] "/tmp/RtmpVAwsWi/file243436451ae3.csv"
kc$import(storms_memory_tmp) %>% tibble::as_tibble()
## # A tibble: 10,010 x 13
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>
#> export an Elasticsearch index to a file
"storms" %>% kc$export(data = ., filepath = storms_elastic_tmp)
## [1] "/tmp/RtmpVAwsWi/file24343220815.csv"
kc$import(storms_elastic_tmp) %>% tibble::as_tibble()
## # A tibble: 10,010 x 14
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Ike 2008 9 7 18 21 -74 hurri… 3 105 946
## 2 Ike 2008 9 8 0 21.1 -75.2 hurri… 4 115 945
## 3 Ike 2008 9 8 2 21.1 -75.7 hurri… 4 115 945
## 4 Ike 2008 9 8 6 21.1 -76.5 hurri… 3 100 950
## 5 Ike 2008 9 8 12 21.1 -77.8 hurri… 2 85 960
## 6 Ike 2008 9 8 18 21.2 -79.1 hurri… 1 75 964
## 7 Ike 2008 9 9 0 21.5 -80.3 hurri… 1 70 965
## 8 Ike 2008 9 9 6 22 -81.4 hurri… 1 70 965
## 9 Ike 2008 9 9 12 22.4 -82.4 hurri… 1 70 965
## 10 Ike 2008 9 9 14 22.6 -82.9 hurri… 1 70 965
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>, kid <int>
This method can also automatically use zip
by adding the file extension.
#> file with zip extension
storms_memory_zip <- tempfile(fileext=".csv.zip")
#> export it
dplyr::storms %>% kc$export(storms_memory_zip)
## [1] "/tmp/RtmpVAwsWi/file243412667717.csv.zip"
Note: kibior
is using rio
(Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats()
function.
$import()
: Get a file content to a new Elasticsearch indexThe $import()
method can duplicate a dataset retrieved from a file to a in-memory variable
, a new Elasticsearch index
or both
.
#> import data from file
kc$import(filepath = storms_memory_tmp)
## # A tibble: 10,010 x 13
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>
#> import data from file and send it to a new
#> Elasticsearch index, with default configuration
kc$import(filepath = storms_memory_tmp,
push_index = "storms_file",
push_mode = "recreate")
## # A tibble: 10,010 x 14
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Sean 2011 11 10 18 30.5 -70 tropi… 0 55 983
## 2 Sean 2011 11 11 0 31 -69 tropi… 0 55 984
## 3 Sean 2011 11 11 6 32.2 -67.2 tropi… 0 50 987
## 4 Sean 2011 11 11 12 33.4 -65.3 tropi… 0 45 991
## 5 Sean 2011 11 11 18 34.8 -62.6 tropi… 0 40 995
## 6 Albe… 2012 5 19 6 32.8 -77.1 tropi… -1 30 1008
## 7 Albe… 2012 5 19 12 32.5 -77.3 tropi… 0 40 1005
## 8 Albe… 2012 5 19 18 32.3 -77.6 tropi… 0 45 997
## 9 Albe… 2012 5 20 0 32.1 -78.1 tropi… 0 50 995
## 10 Albe… 2012 5 20 6 31.9 -78.7 tropi… 0 45 998
## # … with 10,000 more rows, and 3 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>, kid <int>
kc$list()
## [1] "starwars" "storms" "storms_file"
As $export()
, it can also read directly from zipped
files.
#> import data from file and send it to a new
#> Elasticsearch index, with default configuration
kc$import(storms_memory_zip)
## # A tibble: 10,010 x 13
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <int> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011
## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006
## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004
## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002
## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>
Note: kibior
is using rio
(Chan et al. 2018) that can export much more formats. See rio documentation and rio::install_formats()
function.
The $import()
method can natively manage sequence, alignement and feature formats (e.g. fasta, bam, gtf, gff, bed, etc.) since it also wraps Bioconductor library methods such as rtracklayer::import()
(Lawrence, Gentleman, and Carey 2019), Biostrings::read*StringSet()
(Pagès et al. 2020) and Rsamtools::scanBam()
(Morgan et al. 2020).
Dedicated methods are implemented inside kibior
(e.g. $import_features()
and $import_alignments()
), and the generic $import()
method tries to open the right format according to file extension. You can also use specific methods if the import cannot be guessed by the general import()
method: import_sequences()
, import_alignments()
, import_features()
, import_tabluar()
and import_json()
.
$move()
: Rename an indexThe $move()
method rename an index. The $copy()
method is equivalent to $move(copy = TRUE)
.
$copy()
: Copy an indexThe $copy()
method copy an index to another name. It is a wrapper around $move(copy = TRUE)
.
$delete()
: Delete an Elasticsearch indexThe $delete()
method deletes one or more indices.
#> delete one or multiple indices
c("storms_file", "storms_file_moved") %>% kc$delete()
## $storms_file
## [1] TRUE
##
## $storms_file_moved
## [1] TRUE
It can also delete following a pattern.
#> push some subsets with the same prefix
push_storm <- function(storm_name, index_name){
dplyr::storms %>%
filter(name == storm_name) %>%
kc$push(index_name)
}
push_storm("Amy", "storms_amy")
## [1] "storms_amy"
push_storm("Doris", "storms_doris")
## [1] "storms_doris"
push_storm("Bess", "storms_bess")
## [1] "storms_bess"
#> list
kc$list()
## [1] "starwars" "storms_bess" "storms" "storms_doris" "storms_amy"
#> delete following a pattern
kc$delete("storms_*")
## $storms_amy
## [1] TRUE
##
## $storms_doris
## [1] TRUE
##
## $storms_bess
## [1] TRUE
kc$list()
## [1] "starwars" "storms"
$search()
: Search everythingElasticsearch is here… You know, For search. As a search engine, it is its main feature.
Using $search()
method, you can search for everything inside a part or all data indexed by Elasticsearch. If no restrictions is found in the query
parameter, all data will be searched, which means in every indices, every columns, every keywords.
#> here, we search the exact string "something" everywhere
#> but will find nothing
kc$search(query = "something")
## $starwars
## list()
##
## $storms
## list()
#> we search for the exact string "anita" in "storms" dataset
kc$search("storms", query = "anita")[["storms"]]
## # A tibble: 5 x 14
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <int>
## 1 Anita 1977 8 29 12 26.9 -88.4 tropi… -1 20 1012
## 2 Anita 1977 8 29 18 27 -88.9 tropi… -1 25 1010
## 3 Anita 1977 8 30 0 26.9 -89.4 tropi… -1 30 1009
## 4 Anita 1977 8 30 6 26.8 -89.8 tropi… 0 40 1006
## 5 Anita 1977 8 30 12 26.7 -90.3 tropi… 0 50 1003
## # … with 3 more variables: ts_diameter <lgl>, hu_diameter <lgl>, kid <int>
#> we search for text containing the substring "am" in "storms" dataset
kc$pull("storms", query = "*am*")[["storms"]]$name %>% unique
## [1] "Tammy" "Gamma" "Amy" "Amelia"
By default, $search()
has head mode active, which will return a small subset (default is 5
) of the actual complete result to allow quick inspection of data. With $verbose <- TRUE
, it will be printed in the result as “Head mode: on”. To change the head size, modify the $head_search_size
attribute.
To get the full result, you have to use $search(head = FALSE)
, or more simply : $pull()
.
See Querying
in Advanced usage
section for more information.
$stats()
: base statistics of columnsAlongside data handling methods are descriptive statistical methods. You already know $count()
but here some others displayed by kibior
.
The $stats()
method is a shortcut to ask for: count
, min
, max
, avg
, sum
, sum_of_squares
, variance
, std_deviation
, std_deviation_upper (bound)
, std_deviation_lower (bound)
.
#> multi-indices, index pattern and multicolumns
kc$stats(c("starwars", "s*"), c("height", "mass"))
## $starwars
## # A tibble: 2 x 11
## column count min max avg sum sum_of_squares variance std_deviation std_deviation_bounds_… std_deviation_bounds…
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 height 81 66 264 174. 14123 2559177 1194. 34.6 243. 105.
## 2 mass 59 15 1358 97.3 5741. 2224219. 28229. 168. 433. -239.
##
## $storms
## list()
#> work also with query and sigma for standard deviation
kc$stats("starwars", c("height", "mass"), sigma = 2.5, query = "homeworld:naboo")
## $starwars
## # A tibble: 2 x 11
## column count min max avg sum sum_of_squares variance std_deviation std_deviation_bounds_… std_deviation_bounds_…
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 height 11 96 224 175. 1930 349446 984. 31.4 254. 97.1
## 2 mass 6 32 85 64.2 385 26979 379. 19.5 113. 15.5
Some important warnings here:
In addition to $count()
and $stats()
, lots of others methods exist to perform descriptive analysis: avg
, mean
, min
, max
, sum
, q1
, q2
, median
, q3
and summary
.
$describe_index()
and $describe_columns()
: get the description of index and columnsYou can ask for description of datasets with these methods.
Important: this feature requires the user that pushed the data to manually add the metadata with
$add_description()
.
Some methods allow wildcard use "*" such as $search()
and $pull()
.
#> consider these two datasets
dplyr::starwars %>% kc$push("starwars", mode = "recreate")
## [1] "starwars"
dplyr::storms %>% kc$push("storms", mode = "recreate")
## [1] "storms"
#> We want to search all indices startings with an "s"
#> We search for words in the "name" field that start with a "d"
#> Both "index" and "storms" index have a "name" field
s <- kc$search("s*", query = "name:d*", head = FALSE)
s %>% names()
## [1] "starwars" "storms"
s$starwars
## # A tibble: 11 x 14
## name height mass hair_color skin_color eye_color birth_year gender
## <chr> <int> <int> <chr> <chr> <chr> <dbl> <chr>
## 1 R2-D2 96 32 <NA> white, bl… red 33 <NA>
## 2 Dart… 202 136 none white yellow 41.9 male
## 3 R5-D4 97 32 <NA> white, red red NA <NA>
## 4 Bigg… 183 84 black light brown 24 male
## 5 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
## 6 Dart… 175 80 none red yellow 54 male
## 7 Dud … 94 45 none blue, grey yellow NA male
## 8 Dormé 165 NA brown light brown NA female
## 9 Dooku 193 80 white fair brown 102 male
## 10 Dext… 198 102 none brown yellow NA male
## 11 Poe … NA NA brown light brown NA male
## # … with 6 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <chr>, starships <chr>, kid <int>
s$storms
## # A tibble: 722 x 14
## name year month day hour lat long status category wind pressure
## <chr> <int> <int> <int> <int> <dbl> <dbl> <chr> <chr> <int> <int>
## 1 Debby 1988 9 6 6 21.5 -107. tropi… -1 30 1005
## 2 Debby 1988 9 6 12 22 -107. tropi… -1 25 1005
## 3 Debby 1988 9 6 18 22.5 -108. tropi… -1 25 1006
## 4 Debby 1988 9 7 0 23 -108 tropi… -1 25 1006
## 5 Debby 1988 9 7 6 23.5 -108. tropi… -1 25 1007
## 6 Debby 1988 9 7 12 23.9 -108. tropi… -1 25 1007
## 7 Debby 1988 9 7 18 24.2 -109. tropi… -1 25 1007
## 8 Debby 1988 9 8 0 24.4 -109. tropi… -1 25 1008
## 9 Debby 1988 9 8 6 24.3 -109. tropi… -1 20 1008
## 10 Debby 1988 9 8 12 24.2 -109. tropi… -1 20 1008
## # … with 712 more rows, and 3 more variables: ts_diameter <dbl>,
## # hu_diameter <dbl>, kid <int>
As objects, kibior
instances attributes can be accessed and updated for some.
Attribute name | Read-only | Default | Description |
---|---|---|---|
$host | “localhost” | the Elasticsearch host | |
$port | 9200 | the Elasticsearch port | |
$user | x | NULL | the Elasticsearch user |
$pwd | x | NULL | the Elasticsearch password |
$connection | x | NULL | the Elasticsearch connection object |
$head_search_size | 5 | the head size default value | |
$cluster_name | x | When connected | the cluster name if and only if already connected |
$cluster_status | x | When connected | the cluster status if and only if already connected |
$nb_documents | x | When connected | the current cluster total number of documents if already connected |
$version | x | When connected | the Elasticsearch version if and only if already connected |
$elastic_wait | 2 | the Elasticsearch wait time for update commands if already connected (in seconds) | |
$valid_joins | x | A vector | the valid joins available in `kibior |
$valid_count_types | x | A vector | the valid count types available (mainly observations = rows, variables = columns) |
$valid_elastic_metadata_types | x | A vector | the valid Elasticsearch metadata types available |
$valid_push_modes | x | A vector | the valid push modes available |
$shard_number | 1 | the number of allocated primary shards when creating an Elasticsearch index | |
$shard_replicas_number | 1 | the number of allocated replicas in an Elasticsearch index | |
$default_id_col | “kid” | the ID column name used when sending data to Elasticsearch if not provided by user | |
$verbose | FALSE | the verbose mode | |
$quiet_progress | FALSE | the progress bar printing mode | |
$quiet_results | FALSE | the method results printing mode |
#> access the current host for the "kc" instance
kc$host
## [1] "elasticsearch"
#> modify the head_search threshold
kc$head_search_size <- 10L
Some attributes cannot be modified.
Working alone directly on a massive cluster of servers is an unlikely situation. Moreover, handling large datasets on your own computer or storing all data in your local Elasticsearch repository is generally a bad idea. We generally tend to only handle what we can afford to, and organize pipelines and softwares accordingly.
There are multiple strategies to organize data, and our main objective here is to use servers for what they have been built for: to do the cpu- and memory-greedy job. Thus, in comparison, our personal computers or laptop will not have huge load processes. Putting kibior
in this equation will help us further as it is backed by a database and search engine
.
As a rule of thumb, subsetting and querying
is a good strategy, e.g. splitting on categorial variables
.
#> push storms dataset
dplyr::storms %>%
kc$push("storms", mode = "recreate")
## [1] "storms"
#> select the first 5 storms names and push them
#> in different indices, each name prefixed with "storms_"
dplyr::storms %>%
split(dplyr::storms$name) %>%
head() %>%
purrr::imap(function(data, index_name){
index_name %>%
tolower() %>%
paste0("storms_", .) %>%
kc$push(data, .)
})
## $AL011993
## [1] "storms_al011993"
##
## $AL012000
## [1] "storms_al012000"
##
## $AL021992
## [1] "storms_al021992"
##
## $AL021994
## [1] "storms_al021994"
##
## $AL021999
## [1] "storms_al021999"
##
## $AL022000
## [1] "storms_al022000"
kc$list()
## [1] "starwars" "storms" "storms_al011993" "storms_al012000"
## [5] "storms_al021992" "storms_al021994" "storms_al021999" "storms_al022000"
What we can do then, is searching in all indices names starting with the prefix “storms_
”
#> Within them, we search some minimum winds and pressure
#> results come already filtered by storm names
kc$search("storms_*",
query = "wind:>25 && pressure:>30",
columns = c("name", "year", "month", "lat", "long", "status"),
head = FALSE)
## $storms_al011993
## # A tibble: 4 x 6
## month year name lat long status
## <int> <int> <chr> <dbl> <dbl> <chr>
## 1 6 1993 AL011993 25.4 -77.5 tropical depression
## 2 6 1993 AL011993 26.1 -75.8 tropical depression
## 3 6 1993 AL011993 26.7 -74 tropical depression
## 4 6 1993 AL011993 27.8 -71.8 tropical depression
##
## $storms_al021992
## # A tibble: 4 x 6
## month year name lat long status
## <int> <int> <chr> <dbl> <dbl> <chr>
## 1 6 1992 AL021992 25.7 -85.5 tropical depression
## 2 6 1992 AL021992 27 -84.5 tropical depression
## 3 6 1992 AL021992 27.6 -84 tropical depression
## 4 6 1992 AL021992 28.5 -82.9 tropical depression
##
## $storms_al022000
## # A tibble: 10 x 6
## month year name lat long status
## <int> <int> <chr> <dbl> <dbl> <chr>
## 1 6 2000 AL022000 9.6 -21 tropical depression
## 2 6 2000 AL022000 9.9 -22.6 tropical depression
## 3 6 2000 AL022000 10.2 -24.5 tropical depression
## 4 6 2000 AL022000 10.1 -26.2 tropical depression
## 5 6 2000 AL022000 9.9 -27.8 tropical depression
## 6 6 2000 AL022000 9.9 -29.3 tropical depression
## 7 6 2000 AL022000 10.1 -30.1 tropical depression
## 8 6 2000 AL022000 10.1 -32.6 tropical depression
## 9 6 2000 AL022000 10 -34.2 tropical depression
## 10 6 2000 AL022000 9.8 -36.2 tropical depression
##
## $storms_al021994
## # A tibble: 2 x 6
## month year name lat long status
## <int> <int> <chr> <dbl> <dbl> <chr>
## 1 7 1994 AL021994 33 -79.1 tropical depression
## 2 7 1994 AL021994 33.2 -79.2 tropical depression
##
## $storms_al021999
## # A tibble: 3 x 6
## month year name lat long status
## <int> <int> <chr> <dbl> <dbl> <chr>
## 1 7 1999 AL021999 20.2 -95 tropical depression
## 2 7 1999 AL021999 20.6 -96.3 tropical depression
## 3 7 1999 AL021999 20.5 -97 tropical depression
##
## $storms_al012000
## list()
As we show before, we did not push all data but only some subsets of interest. By selecting and pushing what we need, datasets can be searched and shared immediately after
.
If you work in sync with multiple remote collaborators
on the same Elasticsearch cluster, that can be a great strategy. For instance, one of your collaborators can add a new dataset that will not change the request, but will enrich the result.
#> added from remote kibior instance
#> using `tail()` to simulate other data
dplyr::storms %>%
split(dplyr::storms$name) %>%
tail(2) %>%
purrr::imap(function(data, index_name){
index_name %>%
tolower() %>%
paste0("storms_", .) %>%
kc$push(data, .)
})
## $Wilma
## [1] storms_wilma
##
## $Zeta
## [1] storms_zeta
We can apply the same request and found some new results.
#> search all, same request as before
s <- kc$search("storms_*",
query = "wind:>25 && pressure:>30",
columns = c("name", "year", "month", "lat", "long", "status"),
head = FALSE)
#> assemble results if needed
do.call(rbind, s)
## # A tibble: 96 x 6
## month year name lat long status
## * <int> <int> <chr> <dbl> <dbl> <chr>
## 1 6 1993 AL011993 25.4 -77.5 tropical depression
## 2 6 1993 AL011993 26.1 -75.8 tropical depression
## 3 6 1993 AL011993 26.7 -74 tropical depression
## 4 6 1993 AL011993 27.8 -71.8 tropical depression
## 5 12 2005 Zeta 23.9 -35.6 tropical depression
## 6 12 2005 Zeta 24.2 -36.1 tropical storm
## 7 12 2005 Zeta 24.7 -36.6 tropical storm
## 8 12 2005 Zeta 25.2 -37 tropical storm
## 9 12 2005 Zeta 25.6 -37.3 tropical storm
## 10 12 2005 Zeta 25.7 -37.6 tropical storm
## # … with 86 more rows
One of the main features of kibior
is to be able to search inside vast amounts of data thanks to Elasticsearch. You can use the search feature with the eponym method $search()
but also $pull()
by using the query
parameter.
To query specific data, the query
parameter of methods such as $count()
or $search()
requires one string following the Elasticsearch Query String Syntax.
To sum them up, you can search for:
terms
,kc$search("starwars", query = "orange")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Jar … 196 66 none orange orange 52 male mascu… Naboo Gungan <chr… "" ""
## 2 Plo … 188 80 none orange black 22 male mascu… Dorin Kel Dor <chr… "" "Jedi st…
## 3 Jabb… 175 1358 NA green-tan… orange 600 herm… mascu… Nal Hutta Hutt <chr… "" ""
## 4 Ackb… 180 83 none brown mot… orange 41 male mascu… Mon Cala Mon Ca… <chr… "" ""
## 5 Roos… 224 82 none grey orange NA male mascu… Naboo Gungan <chr… "" ""
## # … with 1 more variable: kid <int>
phrases
, with double-quotes.kc$search("starwars", query = '"Luke Skywalker"')$starwars
## # A tibble: 1 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## # … with 1 more variable: kid <int>
To complement, you can apply multiple operators:
boolean operators
:
grouping
: organize boolean operators, ex: “(quick OR brown) AND fox
”.
field selecting
: target a specific column.
#> rows that have "name" == "Luke Skywalker"
kc$search("starwars", query = 'name:"Luke Skywalker"')$starwars
## # A tibble: 1 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## # … with 1 more variable: kid <int>
#> rows that have blue or green eyes
kc$search("starwars", query = 'eye_color:(blue OR green)')$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Grie… 216 159 none brown, wh… green, y… NA male mascu… Kalee Kaleesh <chr… <chr [1… <chr [1]>
## 2 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## 3 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… <chr [1… <chr [1]>
## 4 Beru… 165 75 brown light blue 47 fema… femin… Tatooine Human <chr… <chr [1… <chr [1]>
## 5 Anak… 188 84 blond fair blue 41.9 male mascu… Tatooine Human <chr… <chr [2… <chr [3]>
## # … with 1 more variable: kid <int>
range notation
: using [min TO max] for inclusive or {min TO max} for exclusive.
n:>=10
is equivalent to n:[10 TO *]
.n:<=10
is equivalent to n:[* TO 10]
.n:>10
is equivalent to n:{10 TO *}
.n:<10
is equivalent to n:{* TO 10}
.#> include 160 and 180 values
kc$search("starwars", query = "height:[160 TO 180]")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## 2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid <chr… <chr [1… <chr [1]>
## 3 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… <chr [1… <chr [1]>
## 4 Beru… 165 75 brown light blue 47 fema… femin… Tatooine Human <chr… <chr [1… <chr [1]>
## 5 Wilh… 180 NA auburn, g… fair blue 64 male mascu… Eriadu Human <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>
#> exclude 160 and 180 values
kc$search("starwars", query = "height:{160 TO 180}")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## 2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid <chr… <chr [1… <chr [1]>
## 3 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… <chr [1… <chr [1]>
## 4 Beru… 165 75 brown light blue 47 fema… femin… Tatooine Human <chr… <chr [1… <chr [1]>
## 5 Gree… 173 74 NA green black 44 male mascu… Rodia Rodian <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>
#> exclude 160 but include 180
kc$search("starwars", query = "height:{160 TO 180]")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## 2 C-3PO 167 75 NA gold yellow 112 none mascu… Tatooine Droid <chr… <chr [1… <chr [1]>
## 3 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… <chr [1… <chr [1]>
## 4 Beru… 165 75 brown light blue 47 fema… femin… Tatooine Human <chr… <chr [1… <chr [1]>
## 5 Wilh… 180 NA auburn, g… fair blue 64 male mascu… Eriadu Human <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>
fuzzyness and proximity
: using “~” at the end of a term to use approximative search.
quikc~
” and “quikc~2
” are identical."fox quick"~5
”.#> fuzzy search for blue/black/brown/... eyes
#> useful when we do not know exactly the content
kc$search("starwars", query = "eye_color:bla~3")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <list> <list>
## 1 Luke… 172 77 blond fair blue 19 male mascu… Tatooine Human <chr… <chr [2… <chr [2]>
## 2 Owen… 178 120 brown, gr… light blue 52 male mascu… Tatooine Human <chr… <chr [1… <chr [1]>
## 3 Beru… 165 75 brown light blue 47 fema… femin… Tatooine Human <chr… <chr [1… <chr [1]>
## 4 Anak… 188 84 blond fair blue 41.9 male mascu… Tatooine Human <chr… <chr [2… <chr [3]>
## 5 Wilh… 180 NA auburn, g… fair blue 64 male mascu… Eriadu Human <chr… <chr [1… <chr [1]>
## # … with 1 more variable: kid <int>
boosting
: using “^” ponderate some expressions over others.
terms
, ex: quick^2 fox
, quick is boosted.phrases
, ex: "foo bar"^2
.groups
, ex: (foo bar)^4
.#> boost the black eye search but get the blue too
kc$search("starwars", query = "eye_color:(black^2 OR blue)")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Gree… 173 74 NA green black 44 male mascu… Rodia Rodian <chr… "" ""
## 2 Nien… 160 68 none grey black NA male mascu… Sullust Sullus… <chr… "" "Millenn…
## 3 Gasg… 122 NA none white, bl… black NA male mascu… Troiken Xexto <chr… "" ""
## 4 Kit … 196 87 none green black NA male mascu… Glee Ans… Nautol… <chr… "" ""
## 5 Plo … 188 80 none orange black 22 male mascu… Dorin Kel Dor <chr… "" "Jedi st…
## # … with 1 more variable: kid <int>
Now, we can consider making easily a more complex search query:
#> consider this dataset
ggplot2::diamonds %>% kc$push("diamonds")
## [1] "diamonds"
#> searching premium or ideal quality of diamonds,
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, and not colors E or H.
#> we only want some columns.
kc$search("diamonds",
query = "cut:(premium || ideal)
&& price:<10000
&& carat:>1.4
&& z:[2.2 TO 5.4]
&& -color:(E || H)",
columns = c("carat", "color", "depth", "clarity", "price", "z"),
head = FALSE)
## $diamonds
## # A tibble: 765 x 6
## depth color clarity price carat z
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 62.4 J SI1 8176 1.59 4.66
## 2 62.7 I SI1 8193 1.51 4.59
## 3 61.5 J VS2 8203 1.51 4.54
## 4 62 J VS1 8207 1.54 4.62
## 5 62 J VS2 8217 1.51 4.54
## 6 62.2 I SI2 8220 1.62 4.69
## 7 62.4 J SI1 8221 1.57 4.65
## 8 60.3 J VVS2 8227 1.59 4.59
## 9 62.6 I SI2 8228 1.57 4.63
## 10 62 I SI2 8254 1.54 4.56
## # … with 755 more rows
$search()
behavior#> consider this dataset
dplyr::storms %>% kc$push("storms", mode = "recreate")
## [1] "storms"
dplyr::starwars %>% kc$push("starwars", mode = "recreate")
## [1] "starwars"
Though Elasticsearch is very powerful as a document-oriented database, it is a full-text search engine.
#> searching for exact word "dar" but nothing found
kc$search(query = "dar")
## $diamonds
## list()
##
## $starwars
## list()
##
## $storms
## list()
With wildcard and targeting a single index:
#> The search is case-insensitive meaning:
#> Dar == dAr == daR == DAr == ...etc.
kc$search(query = "*Dar*")$starwars
## # A tibble: 5 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Dart… 202 136 none white yellow 41.9 male mascu… Tatooine Human <chr… "" "TIE Adv…
## 2 Bigg… 183 84 black light brown 24 male mascu… Tatooine Human <chr… "" "X-wing"
## 3 Land… 177 79 black dark brown 31 male mascu… Socorro Human <chr… "" "Millenn…
## 4 Watto 137 NA black blue, grey yellow NA male mascu… Toydaria Toydar… <chr… "" ""
## 5 Quar… 183 NA black dark brown 62 NA NA Naboo NA <chr… "" ""
## # … with 1 more variable: kid <int>
Column selection:
#> searching every word in name that starts with "d"
kc$search("*",
query = "name:d*",
columns = c("name", "status"))
## $diamonds
## list()
##
## $starwars
## # A tibble: 5 x 1
## name
## <chr>
## 1 R2-D2
## 2 Darth Vader
## 3 R5-D4
## 4 Biggs Darklighter
## 5 Jabba Desilijic Tiure
##
## $storms
## # A tibble: 5 x 2
## name status
## <chr> <chr>
## 1 Debby tropical depression
## 2 Debby tropical depression
## 3 Debby tropical depression
## 4 Debby tropical depression
## 5 Debby tropical depression
As you can see on the last request, some columns did not match, thus were not returned.
Now a more complex search, directly done by pulling data:
#> We can search premium or ideal quality of diamonds,
#> with a price inferior to 10k$, a carat superior to 1.4,
#> a z between 2.2 and 5.4 included, not colors E or H,
#> and not from a clarity starting with the string "VS"
#> we only want some columns.
kc$pull("diamonds",
query = "cut:(premium || ideal)
&& price:<10000
&& carat:>1.4
&& z:[2.2 TO 5.4]
&& -color:(E || H)
&& -clarity:VS*",
columns = c("carat", "color", "depth", "clarity", "price", "z"))
## $diamonds
## # A tibble: 552 x 6
## depth color clarity price carat z
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 62.8 I SI1 8574 1.51 4.58
## 2 61.4 G SI2 8580 1.5 4.52
## 3 62 G SI2 8580 1.5 4.52
## 4 62.8 G SI1 8599 1.43 4.49
## 5 62.2 J SI1 8610 1.65 4.7
## 6 62.7 D SI2 8631 1.52 4.59
## 7 60.8 G SI2 8637 1.51 4.51
## 8 61.9 I SI2 8637 1.53 4.56
## 9 62 G SI2 8643 1.57 4.62
## 10 62.1 I SI1 8685 1.5 4.56
## # … with 542 more rows
This was executed on a small 54k observations and 10 variables dataset. We will see it on a bigger one in biological example
vignette.
text
and keyword
queryingLastly, we need to see the difference between a keyword
and a text
field.
Elasticsearch can index text values as two different types: text
and keyword
. The difference between those two is that:
text
columns such as “name” or “skin_color” are broken up into words during indexing, allowing searches on one or more words,#> search every documents which have at least
#> a word in "name" columns starting with "L"
kc$pull("starwars",
query = "name:L*",
columns = "name")$starwars
## # A tibble: 10 x 1
## name
## <chr>
## 1 Luke Skywalker
## 2 Leia Organa
## 3 Owen Lars
## 4 Beru Whitesun lars
## 5 Lando Calrissian
## 6 Lobot
## 7 Cliegg Lars
## 8 Poggle the Lesser
## 9 Luminara Unduli
## 10 Lama Su
keyword
columns (always added when pushing data with kibior
) keep the full text as one string.#> search every documents which have their "name"
#> field starting with "L"
kc$pull("starwars",
query = "name.keyword:L*",
columns = "name")$starwars
## # A tibble: 6 x 1
## name
## <chr>
## 1 Luke Skywalker
## 2 Leia Organa
## 3 Lando Calrissian
## 4 Lobot
## 5 Luminara Unduli
## 6 Lama Su
kibior
indexes all text values as text
AND keyword
, so we can use whole-text search (with .keyword
tag) AND word-specific (without .keyword
tag).
Doing a search for a word starting with a specific prefix in pure R is a bit more annoying:
dplyr::starwars[["name"]] %>% #> take the name column data
lapply(function(x){ #> for each name
stringr::str_split(x, " ") %>% #> split name by space
unlist(use.names = FALSE) %>% #> align
grepl("^L", ., ignore.case = TRUE) %>% #> search pattern for words starting with "L", ignore case to search also for "^l"
any() #> TRUE if at least one word match
}) %>% #> list of logicals
unlist(use.names = FALSE) %>% #> flatten it to logical vector to match starwars observations number
dplyr::starwars[.,] %>% #> apply logical filter only on lines that were found
dplyr::select(name) #> select only "name" var
## # A tibble: 10 x 1
## name
## <chr>
## 1 Luke Skywalker
## 2 Leia Organa
## 3 Owen Lars
## 4 Beru Whitesun lars
## 5 Lando Calrissian
## 6 Lobot
## 7 Cliegg Lars
## 8 Poggle the Lesser
## 9 Luminara Unduli
## 10 Lama Su
Elasticsearch has some reserved characters : + - = && || > < ! ( ) { } [ ] ^ " ~ * ? : \ /
You should remove them before pushing them into Elasticsearch. If it is not possible or you want to retrieve data from someone else that contains reserved characters, you should try to query with a keyword
field.
$push()
detailsWhen pushing data with default parameters, kibior
will define unique IDs for each record (each line of a table) and add them as metadata. You can retrieve them by using $pull(keep_metadata = TRUE)
.
#> With the storms index
kc$pull("storms", keep_metadata = TRUE)$storms
## # A tibble: 10,010 x 21
## `_index` `_type` `_id` `_version` `_seq_no` `_primary_term` found `_source.name` `_source.year` `_source.month`
## <chr> <chr> <chr> <int> <int> <int> <lgl> <chr> <int> <int>
## 1 storms _doc 10001 1 10000 1 TRUE Kate 2015 11
## 2 storms _doc 10002 1 10001 1 TRUE Kate 2015 11
## 3 storms _doc 10003 1 10002 1 TRUE Kate 2015 11
## 4 storms _doc 10004 1 10003 1 TRUE Kate 2015 11
## 5 storms _doc 10005 1 10004 1 TRUE Kate 2015 11
## 6 storms _doc 10006 1 10005 1 TRUE Kate 2015 11
## 7 storms _doc 10007 1 10006 1 TRUE Kate 2015 11
## 8 storms _doc 10008 1 10007 1 TRUE Kate 2015 11
## 9 storms _doc 10009 1 10008 1 TRUE Kate 2015 11
## 10 storms _doc 10010 1 10009 1 TRUE Kate 2015 11
## # … with 10,000 more rows, and 11 more variables: `_source.day` <int>, `_source.hour` <int>, `_source.lat` <dbl>,
## # `_source.long` <dbl>, `_source.status` <chr>, `_source.category` <chr>, `_source.wind` <int>,
## # `_source.pressure` <int>, `_source.ts_diameter` <dbl>, `_source.hu_diameter` <dbl>, `_source.kid` <int>
Metadata columns are mainly prefixed by an underscore. The actual record is embedded into the _source
field. Since data have been pushed without specifying an ID column, the _id
field that defines Elasticsearch unique IDs reflects the one automatically added by kibior
in the data (kid
by default). To change the default ID column added by kibior
, change the $default_id_col
attribute value.
Letting kibior
handle ID attribution will produce uniqueness, but might not be the most meaningful and practical for update.
To change that behavior, you can define your own ID field when calling $push()
data by using the id_col
parameter.
#> Again, pushing storms, but with our own IDs, for instance,
#> by adding "aaa" at the begining of each row number and use it as ID.
data <- dplyr::storms
ids <- seq_len(nrow(data)) %>% paste("aaa", ., sep="")
data <- cbind(a_new_unique_id = ids, data)
#> the column "a_new_unique_id" will be used as our unique ID
kc$push(data, "storm_with_our_id", id_col = "a_new_unique_id")
## [1] "storm_with_our_id"
#> and see
s <- kc$pull("storm_with_our_id",
columns = "a_new_unique_id",
keep_metadata = TRUE)$storm_with_our_id
s %>% dplyr::select(c("_id", "_source.a_new_unique_id"))
## # A tibble: 10,010 x 2
## `_id` `_source.a_new_unique_id`
## <chr> <chr>
## 1 aaa8991 aaa8991
## 2 aaa8992 aaa8992
## 3 aaa8993 aaa8993
## 4 aaa8994 aaa8994
## 5 aaa8995 aaa8995
## 6 aaa8996 aaa8996
## 7 aaa8997 aaa8997
## 8 aaa8998 aaa8998
## 9 aaa8999 aaa8999
## 10 aaa9000 aaa9000
## # … with 10,000 more rows
Caution here: the columns
parameter does not apply to metadata.
#> columns match nothing except actual pushed data columns
kc$pull("storms", keep_metadata = TRUE, columns = c("_id", "_version"))$storms
## # A tibble: 10,010 x 7
## `_index` `_type` `_id` `_version` `_seq_no` `_primary_term` found
## <chr> <chr> <chr> <int> <int> <int> <lgl>
## 1 storms _doc 10001 1 10000 1 TRUE
## 2 storms _doc 10002 1 10001 1 TRUE
## 3 storms _doc 10003 1 10002 1 TRUE
## 4 storms _doc 10004 1 10003 1 TRUE
## 5 storms _doc 10005 1 10004 1 TRUE
## 6 storms _doc 10006 1 10005 1 TRUE
## 7 storms _doc 10007 1 10006 1 TRUE
## 8 storms _doc 10008 1 10007 1 TRUE
## 9 storms _doc 10009 1 10008 1 TRUE
## 10 storms _doc 10010 1 10009 1 TRUE
## # … with 10,000 more rows
When pushing data, if the index you are using in $push()
already exists, an error will be thrown. This is due to mode = "check"
parameter that will check if an index with the name you gave already exists. This is the default option, but can be changed to "recreate"
or "update"
:
"recreate"
will erase the index and write to a fresh one with the same name. Be cautious with this option as you will erase previously written data from that index name.#> recreate one index, whether it already exists or no
dplyr::starwars %>% kc$push("starwars", mode = "recreate")
## [1] "starwars
"update"
will push and update indexed data with corresponding IDs. For this option, you must know which field is the unique ID and send updated documents over them. You do not need all data to be updated, just send a subset of updated data. Send all data again might be error prone and can take a lot of time if your dataset is big. Knowing which field is the unique ID also helps a lot and prevent errors.#> we will change the height of orange-eyed inhabitants of "Naboo"
#> homeworld to 300 and update that subset to the main one.
s <- kc$pull("starwars", query = "eye_color:orange && homeworld:naboo")$starwars
s
## # A tibble: 3 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Jar … 196 66 none orange orange 52 male mascu… Naboo Gungan <chr… "" ""
## 2 Roos… 224 82 none grey orange NA male mascu… Naboo Gungan <chr… "" ""
## 3 Rugo… 206 NA none green orange NA male mascu… Naboo Gungan <chr… "" ""
## # … with 1 more variable: kid <int>
#> change the height of those selected to 300
s$height <- 300
s
## # A tibble: 3 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Jar … 300 66 none orange orange 52 male mascu… Naboo Gungan <chr… "" ""
## 2 Roos… 300 82 none grey orange NA male mascu… Naboo Gungan <chr… "" ""
## 3 Rugo… 300 NA none green orange NA male mascu… Naboo Gungan <chr… "" ""
## # … with 1 more variable: kid <int>
#> and update the main dataset. Since it is a subset of that dataset,
#> IDs are the same, which is default "kid" column.
ns <- kc$push(s, "starwars", mode = "update", id_col = "kid")
#> see the result
ns <- kc$pull("starwars",
query = "eye_color:orange && homeworld:naboo")$starwars
ns
## # A tibble: 3 x 15
## name height mass hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles starships
## <chr> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <lis> <chr> <chr>
## 1 Jar … 300 66 none orange orange 52 male mascu… Naboo Gungan <chr… "" ""
## 2 Roos… 300 82 none grey orange NA male mascu… Naboo Gungan <chr… "" ""
## 3 Rugo… 300 NA none green orange NA male mascu… Naboo Gungan <chr… "" ""
## # … with 1 more variable: kid <int>
dplyr
functionsdplyr
package offers simple and effective functions called filter and select to quickly reduce the scope of interest. In the same fashion, kibior
uses Elasticsearch query string syntax that is very similar to the dplyr syntax (see Querying section). Elasticsearch decuple the search possibilities by allowing similar usage on multiple indices, or datasets, on multiple remote servers.
Moreover, using $count()
, $search()
or $pull()
, one can use their analogous features:
dplyr::select()
with columns
parameter,dplyr::filter()
with query
parameter.Using both of them result in much more powerful search capabilities in a much more readable code.
Following sections are some examples of analogous requests.
Select some columns:
#> dplyr
s <- dplyr::starwars %>%
dplyr::select(name, height, homeworld)
#> kibior
s <- kc$pull("starwars",
columns = c("name", "height", "homeworld"))
Filter on strict thresholds:
#> dplyr
s <- dplyr::starwars %>%
dplyr::filter(height > 180)
#> kibior
s <- kc$pull("starwars",
query = "height:>180")
Filter on soft thresholds:
#> dplyr
s <- dplyr::starwars %>%
dplyr::filter(height >= 180)
#> kibior
s <- kc$pull("starwars",
query = "height:>=180")
#> or with range notation
s <- kc$pull("starwars",
query = "height:[180 TO *]")
Filter on ranges:
#> dplyr
s <- dplyr::starwars %>%
dplyr::filter(height >= 180 && height < 300)
#> kibior
s <- kc$pull("starwars",
query = "height:[180 TO 300}")
Filter on exact string match for one field:
#> dplyr
s <- dplyr::starwars %>%
dplyr::filter(homeworld == "Naboo")
#> kibior
s <- kc$pull("starwars",
query = "homeworld:Naboo")
Filter on exact string match with multiple choices on one field:
#> dplyr
s <- dplyr::starwars %>%
dplyr::filter(homeworld == "Naboo" || homeworld == "Tatooine")
#> or
s <- dplyr::starwars %>%
dplyr::filter(homeworld %in% c("Naboo", "Tatooine"))
#> kibior (several ways to do it)
s <- kc$pull("starwars",
query = "homeworld:(Naboo || Tatooine)")
Filter on partial string matching:
#> dplyr, we have to use `str_detect`
s <- dplyr::starwars %>%
dplyr::filter(stringr::str_detect(name, "Luk|Dar"))
#> kibior, nothing else required
s <- kc$pull("starwars",
query = "name:(*Luk* || *Dar*)")
Filter over a compositions of multiple filters (multiple columns):
Even if there are lots of similarities regarding the syntax, Elasticsearch is powerful search engine. Thus, requests on billions of records
are less expensive to do with it. Also, Elasticsearch is accessible throught an its API. Numerous people can access it at the same time. Which mean you can work synchronously with a collaborator
pushing data and using them immediately after. Moreover, using wildcards, we can search on multiple indices at once
.
What we can do very easily with Elasticsearch is searching everywhere: in every indices, in every columns, and in every words. Lastly, full-text searches
are the big deal. See Text and Keyword querying for more details.
kibior
will return base types in tibble structures (integer, character, logical, and list) for representing data. If you want to change some columns, use readr::type_convert()
after retrieving the dataset.
If you manage multiple instances, you can compare host:port couple easily with == and != operators
.
Using only one instance of kibior
, you might want to attach
this instance to the global environment. This will indeed remove the instance call at the beginning of each method call (in our examples: kc$...
).
Though it can be practical in local developments for only one instance, we strongly discourage that pratice if you entend to share your code. It can induce wrong behaviors during execution in environments with different configurations or multiple instances.
kibior
integrated dplyr package
joins: full
, left
, right
, inner
, anti
, and semi
joins.
By using kibior
joins, you can apply these joins to in-memory datasets
and Elasticsearch-based indices
. kibior
supports query parameter when joining to accelerate data retrival time but cannot join on listed columns.
#> pushing a subset of data
dplyr::starwars %>%
dplyr::filter(homeworld == "Naboo") %>%
kc$push("starwars_naboo", mode = "recreate")
kc$pull("starwars_naboo")
#> perform an inner join between the in-memory full dataset
#> and the remote subset we have just sent
columns <- c("name", "height", "mass", "gender", "homeworld")
kc$inner_join(dplyr::starwars, "starwars_naboo",
left_columns = columns,
right_columns = columns,
by = c("name", "height", "mass"))
As you can see, kibior
uses suffixes left
and right
on data column.
Appart from moving and copying indices from the same cluster of Elasticsearch instances, the $move()
and $copy()
methods can do the same with REMOTE instances. The remote Elasticsearch endpoint has to be declared inside your elasticsearch.yml
configuration file.
By adding one line to the elasticsearch.yml
configuration file, allowing a server whitelist, Elasticsearch servers can talk to each others. By this, they can transfer data across them in a much faster and secure way.
#> config/elasticsearch.yml
...
reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"
...
Full description can be found on Elasticsearch documentation.
After that, kibior
will be able to use the from_instance
parameter of $move()
and $copy()
.
#> init two ES binding
#> kc_local must be configured
#> we make the assumption that both kc are accessible
kc_local <- kibior$new("es_local")
kc_remote <- kibior$new("es_remote", port = 9205)
#> copy data from kc_remote to kc_local
kc_local$copy(from_index = "remote_index",
to_index = "new_copy_of_remote_index_in_local",
from_instance = kc_remote)
This method allows massive data copying in a much faster
way since all data are structured the same.
As all implementations and developments, there are some limits:
Elasticsearch cannot store uppercase field names, thus all column names are forced to lowercase
when submitted by default.
Elasticsearch interprets dots in strings as nested values (ex: “aaa.bbb” is understand as field “aaa” containing a field “bbb”), which is prone to errors with R language since variables can be named with dots. To avoid errors when pushing data to Elasticsearch, dots in column names are replaced by underscores
.
#> iris column names
datasets::iris %>% names()
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> example with iris dataset
datasets::iris %>% kc$push("iris")
## [1] "iris"
# get columns of index iris
kc$columns("iris")
## $iris
## [1] "kid" "petal_length" "petal_width" "sepal_length" "sepal_width" "species"
Elasticsearch has updatable default limitations to 1000 columns, so if datasets pushed with more than 1000 variables, it will generate an error. Two solutions: try to transpose it
, or define a higher Elasticsearch limit in its configurations
.
Elasticsearch handles each document (each line of a table) with a unique ID: a specific "_id" metadata field. What can be confusing here is that metadata are not on the same level as data in Elasticsearch. To be able to update data more easily by targeting accurately document IDs, we force add a new unique field (default is kid
) when pushing data to Elasticsearch and define it as the unique "_id" field. If you know one of your column is unique and can be used as an ID column, you can use the id_col
of the $push()
method to define this column as main ID.
The columns
parameter does not handle metadata columns.
Elasticsearch is really great for textual and keyword search, for that the text has to have common delimiters to be cut down to words. Passing a single, billions-long, uninterrupted biomolecular sequence is not a good thing for Elasticsearch and may result in an indexing failure.
$move()
and $copy()
for remote instances are very sensitive to authentication and security configurations. Some tasks will not be possible due to each organism security measures. Check with your favorite or proper system administrator.
Joins are not executed server-side (on ES), which actually means the Elasticsearch data must be downloaded before executing the actual join. Querying and selecting columns with joins parameters left_columns
, right_columns
, left_query
and right_query
is realtively important to lower data transfer payload and fasten the execution.
Elasticsearch limits returned results to 10.000 elements per bulk. If you try to set bulk_size
> 10000 in parameter, kibior
will downsize it to match the maximum allowed.
The query parameter expressiveness is a powerful string-based mecanism. Users need to understand that the query parameter sends in one request a query to an Elasticsearch instance. If the request is generated based on a list of elements such as c("id1", "id2", "id3", ...) %>% paste0(collapse = " || ") %>% kc$search("*", query = .)
, it can possibly represents a very long string which cannot be entirely passed down to Elasticsearch properly. One way to counter this issue is to split up the element vector into subset and do mulitple calls. It will be fully automated in future versions.
Kibior applies some modifications on datasets before sending them on Elasticsearch: turns all dataset names to lowercase, removes all dataset dotted-based names to underscore-based names, adds kid
column, etc. All these tranformations can affect the behavior of $*_join()
methods.
The $keys()
method limits by default the number of unique keys found to 1000 since it aggregate a possible unlimited number of keys which can happen when calling it on integer or floating point values. If you want more, change the max_size
method parameter.
kibior
has been tested with these configurations:
Software | Version |
---|---|
Elasticsearch |
6.8 , 7.5 , 7.8 , 7.9 , 7.10 |
R |
3.6.1 , 4.0.2 , 4.0.3 |
RStudio |
1.2.5001, build 93, 7b3fe265 , 1.4.1103, build "Wax Begonia", 458706c3 |
This vignette has been built using the following session:
```r
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04 LTS
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kibior_0.1.1 magrittr_2.0.1 readr_1.4.0 stringr_1.4.0 dplyr_1.0.3
## [6] ggplot2_3.3.3 knitr_1.30
##
## loaded via a namespace (and not attached):
## [1] zip_2.1.1 Rcpp_1.0.6 cellranger_1.1.0 pillar_1.4.7
## [5] compiler_4.0.3 forcats_0.5.0 elastic_1.1.0 tools_4.0.3
## [9] digest_0.6.27 jsonlite_1.7.2 evaluate_0.14 lifecycle_0.2.0
## [13] tibble_3.0.5 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.10
## [17] openxlsx_4.2.3 crul_1.0.0 curl_4.3 yaml_2.2.1
## [21] haven_2.3.1 xfun_0.20 rio_0.5.16 withr_2.4.1
## [25] generics_0.1.0 vctrs_0.3.6 hms_1.0.0 grid_4.0.3
## [29] tidyselect_1.1.0 glue_1.4.2 httpcode_0.3.0 data.table_1.13.6
## [33] R6_2.5.0 readxl_1.3.1 foreign_0.8-80 rmarkdown_2.6
## [37] tidyr_1.1.2 purrr_0.3.4 scales_1.1.1 ellipsis_0.3.1
## [41] htmltools_0.5.1.1 colorspace_2.0-0 stringi_1.5.3 munsell_0.5.0
## [45] crayon_1.3.4
```
</p>
Chamberlain, Scott. 2020. “Elastic: General Purpose Interface to ‘Elasticsearch’.” Bioinformatics. https://CRAN.R-project.org/package=elastic.
Chan, Chung-hong, Geoffrey CH Chan, Thomas J. Leeper, and Jason Becker. 2018. “Rio: A Swiss-Army Knife for Data File I/O.” https://CRAN.R-project.org/package=rio.
Lawrence, Michael, Robert Gentleman, and Vincent Carey. 2019. “Rtracklayer: An R Package for Interfacing with Genome Browsers.” Bioinformatics 25: 1841–2. https://doi.org/10.1093/bioinformatics/btp328.
Morgan, Martin, Hervé Pagès, Valerie Obenchain, and Nathaniel Hayden. 2020. “Rsamtools: Binary Alignment (BAM), FASTA, Variant Call (BCF), and Tabix File Import.” https://doi.org/10.18129/B9.bioc.Rsamtools.
Pagès, H., P. Aboyoun, R. Gentleman, and S. DebRoy. 2020. “Biostrings: Efficient Manipulation of Biological Strings.” https://doi.org/10.18129/B9.bioc.Biostrings.