#metabase #superset #nginx #postgres #graphana #neo4j
shiny proxy
- use oidc information expression can be used to share a volume between users within the same team.
- keycloak integration
- sharing containers among users
- custom html
- can set the container user via --docker-user
- docker images repository
- container proxy repos
proxy.allow-transfer-app
: transfer the app to other user ? also present in the modal form and in the teststrack-app-url
to show the full url for given appalways-show-switch-instance
show modal for given app
Debug mode:
logging:
requestdump: true
level:
root: TRACE
→ sp server client call
we might fetch the user groups from the client and generate a form on the fly. we need a userInfo endpoint. Or we xan enrich the prepared map here.
- it is possible to generate a form and call the server api with js
- this is the admin endpoint to fetch data for the admin panel then the init calls the getadmindata() and returns satatable then the admin template call the init
- for the app start
→ Superset
Overall superset does not support base url, so it's a pain to integrate with SP
→ Metabase
- supports base url
- has CSP protection, so it does not work in iframe. The enterprise version can bypass the limitation, likely for a feature to embed dashboards in other applications.
- still some button are missing (admin/settings) so there is some work to figure out why-> was because of iframe detection, leading to embedded mode.
- there is no anonymous access. So the user will have to login two times. Too bad. -> use of api connection with lua
- pre-configuration (creation of users, connections) might be done on the h2database with a python client?
→ disable CSP
- It is very easy to build a custom metabase and removing that security
- Leverage nginx reverse proxy to hide the CSP headers
The second option looks better:
- no MT build needed and patch to maintain
- general solution reusable for other tool to integrate
- reuse the nginx to disable the login form, see below
→ skip login
One idea is to call the metabase login api and create a cookie, transfered by nginx.
- the h2 database would be preinit with a admin/admin user
- the entripoint would copy the db if not exist in the user mounted folder before starting MT
- that user/pass would be used for the api call by the script
OpenResty is an nginx distribution which includes the LuaJIT interpreter for Lua scripts
FROM openresty/openresty:buster-fat
RUN opm install ledgetech/lua-resty-http thapakazi/lua-resty-cookie
COPY default.conf /etc/nginx/conf.d/
COPY *.lua /usr/local/openresty/nginx/
COPY nginx.conf /usr/local/openresty/nginx/conf/nginx.conf
server {
listen 8080;
server_name your.metabase.domain;
location / {
access_by_lua_file gen_token.lua;
proxy_pass http://127.0.0.1:3000;
}
}
local cjson = require("cjson")
local httpc = require("resty.http").new()
local ck = require("resty.cookie")
local cookie, err = ck:new()
if not cookie then
ngx.log(ngx.ERR, err)
return
end
local field, err = cookie:get("metabase.SESSION")
if not field then
local res, err = httpc:request_uri("http://127.0.0.1:3000/api/session", {
method = "POST",
body = cjson.encode({
username = os.getenv("METABASE_USERNAME"),
password = os.getenv("METABASE_PASSWORD"),
}),
headers = {
["Content-Type"] = "application/json",
},
})
if not res then
ngx.log(ngx.ERR, "request failed:", err)
return
end
local data = cjson.decode(res.body)
local ok, err = cookie:set({
key = "metabase.SESSION",
value = data["id"],
path = "/",
domain = ngx.var.host,
httponly = true,
-- max_age = 1209600,
samesite = "Lax",
})
if not ok then
ngx.log(ngx.ERR, err)
return
end
end
→ enable concurrent connections
Sounds like we could run multiple instances of MT having the same db. For example sharing the db in the team folder, so that team members share their dashboards.
- h2 lilely supports // conn with
h2:file:./data/testdb;AUTO_SERVER=TRUE
- previously metabase was auto_server
→ resources management
→ volume access
Goal:
- In the user directory, files folders are rw across applications
- In the team directory, files and folders are rw across applications and members of the team
- When a volume is mounted in a container, if the folder does not yet exists, it is created with root user and overrides the folder bonded in the container. However of the folder exists at least in the docker image in the host, then it keeps its grant source and here he explicitly chown within the dockerfile -> this is not true with recent docker version
- the uid and primary gid of the container user is used to create files and folders, unless setid/setgid are set, then the owner/group is kept
- we could use sticky bit on other to let the container user change the folder user/group at init time (same behavior as
/usr/bin/passwd
command) - ideally all uid/gid should be the same across containers, but it is not possible (rstudio might use 101 while jupyter 102 and so on) While we can set the user from outside, the app might not work with it
- using volume allows to set the uid/gid but only with tmpfs, cifs or NFS; not bind mount
- it is possible to one liner to create and configure volumes
- acl on the host won't apply within the container
- if we were able to pre-create the folders on the host it would allow the user to write. Still the apps wouldn't be able to cross edit
- this approach works, however it is not supported by shinyproxy.
FROM ubuntu:22.04
RUN mkdir -p '/foo' ; chown '1001':'1001' '/foo'
# then
docker build -t nico:latest .
docker run -it --rm --user=1001:1001 --mount='source=volumeName,target=/foo,readonly=false' nico:latest ls -alrth /|grep foo
drwxr-xr-x 2 1001 1001 4.0K Sep 10 22:26 foo
also we could try to use docker rootless run by the 1000 user (which is used by Jupyter and rstudio)
eg to configure alt docker url
proxy.docker.url: URL and port on which to connect to the docker daemon, if not specified ShinyProxy tries to connect using the Unix socket of the Docker daemon
→ LDH folder design
This allows both research projects on HDS infra and courses/misc projects to work with the same design.
Three level of groups:
- Project: access to
personal
folder related to their projects plus ashared
folder for all members - Project-Admin: same as above plus the
project-personal
folders of every members - Admin: same as above plus all
personal
andshared
folders
The three structure can work that way:
- LDH
- project1-personal
- user1
- user2
- admin-project1
- project1-shared
- project2-personal
- user2
- project2-shared
- project1-personal
Admin would mount:
- LDH:ro
Admin-project1 would mount:
- project1/admin-project1 as project1/personal
- project1-shared as project1/shared
- project1-personal:ro as project1/users
User2 would mount:
- project1/user2 as project1/personal
- project1-shared as project1/shared
- project2/user2 as project2/personal
- project2-shared as project2/shared
Notes:
- If a user has no project, it has no mount point.
- Access to the apps could be require a a sandbox project
- For admin, the access is read only: it avoids mistakes
The volume expression can work that way:
- #{listToCsv('./data/<repl>/' + userId + ':/root/<repl>/personal', projects}
- #{listToCsv('./data/<repl>-shared/:/root/<repl>/shared', projects}
# for admin project
- #{listToCsv('./data/<repl>/:/root/<repl>/users:ro', projects}
# for admin
- #{listToCsv('./data/:/root/projects:ro', projects}
→ grafana
We can provide a graphana instance per category: server metrics, logs, postgres metrics... and set a hme dashboard in anonymous mode.
depening on the user role, they would have access to more or less container info.
→ access
User acces
- server ressources
- user container resource usage
- user container logs
Project admin access
- server resources
- project containers
- project containers log
Admin access
- Server resources
- all containers
- all containers logs
→ logs search
→ postgres metrics
- graphana has a postgres connector so we could query pg at runtime to show metrics such as db size, rables and indexes, query per range time etc...
- there is also the postgres metrics exporter to show the server load
- citusdb also support pg_stat_statements + citus_stat_statements
→ Onlyoffice
→ Databases
→ Neo4j
- the neo4j browser
- neo4j docker image
- neo4j docker compose
- tutorial cypher
- open dataset
- paradise paper dataset
- demo instance
→ Postgres
→ extensions
- citus is a great candidate for a data analytic platform, since it can start with a single node and then scale-up easily
- citusdb docker image
- citus tutorial
→ access management
Needs:
- one empty db per project
- the admin setup a database from ldh
- the user can read and write somewhere else
- tools are preconfigured to access the db: cbeaver, metabase, Jupyter, rstudio...
Proposal:
- one db per project
- the project admin is owner and is rw in the shared schema
- the project user is read only on the shared schema and is owner of a schema
- the credentials for each project are in a .pgpass file in each user home
- cbeaver is setup with each db
How to maintain the db:
- user/groups are not known when they connect: db and users cannot be prepopulated
- admin creds cannot be shared within the containers to create the resources
- a dedicated docker service can listen for ldh containers and infere the : user/project/role from its name. Then it can create if not exists the dbs/users/credentials and put the latter in pgpass/db connections
→ mongodb
→ rocksdb
→ tantivy
→ neo4j
→ Jupyter
- There is official docker images
- it uses conda ans install way too much crap
- some stuff is interesting such the hwalthcheck
- kaggle docker img
→ Custom image
- Rootless so the user is root and can install whatever they need (including .deb)
- we should only provide python kernel, with last stable version (3.12)
- mainstream library will be preinstalled
- onku jupyterlab ui will be avaika me, for sake of simplicity and feature documentation
- we can document how to install new python version and create new kernel
- pyenv will allow to install any version
- virtualenv to allow custom kernel
- overall, libraries will be installed within the virtualenv located into the user home folder, bound to host. By mean they are shared by project
- exenple dockerfile pyenv
- bash shell terminal
- for pdf support
→ Accessing postgres
- accessing the postgres instance will be done through
.pgpass
file maintained by the pg service, and the jupySQL lib pre installed - sqlalchemy likely consider .pgpass file
- jupySQL provides a general way to store connections
- the INI file way is preferred because it allow to list the existing connections and also choose between multiple connections easily
- using ipythin-sql lib
- https://medium.com/analytics-vidhya/postgresql-integration-with-jupyter-notebook-deb97579a38d
- ipython-sql
- ploomber
- jupySQL has replaced ipython-sql
→ Extensions
→ vscode
- shiny proxy example
--disable-getting-started-override
--disable-file-downloads
-disable-telemetry
-disable-update-check
→ rstudio
→ databases
So it is possible to have predefined connections, that can be navigated from the connection panel.
- write this into
/etc/rstudio/connections/Postgres\ parisni.R
library(connections) library(RPostgres) con <- connection_open(RPostgres::Postgres(), dbname = "postgres", host = "postgres", port = 5432, user = "parisni", password = "pwd")
→ airflow
- authent
- auth-manager
- docker image
- docker compose
- airflow enable iframe
- flask sp example
- airflow auth oidc within iframe
- allow airflow iframe
- proxy fix
- redirect login
- other auth
- nginx can rewrite location
- security header
- nginx to change referer for client or backend
we have two options:
- run one webserver per user within SP and all the other services in compose. BTW use remote_user auth + enable iframe
- run all services in compose, and provide a link within SP. BTW configure airflow with the user's auth (keycloak, ldap...)
- same as 2. but starts an nginx in SP to redirect to the unique webserver. BTW uses a identity proxy to log the user
option 1. consumes more resources since there is a webserver per user, but the authent part is managed by SP. Starting the webserver is about 1min option 2. shares one webserver among all users, but the auth part is way more complicated to setup option 3. has all advantages
In all case we will need to register users within airflow
Ideas:
- try to redirect to an airflow with no base_url. (No /airflow ?) -> it 302 to /home instead of /app_proxy/.../home
- try proxy fix
- apparently the cross origin pb comes from http vs https. Where does ths http comes from ? -> the location is http whike the referer is https
- do the proxy w/o DP also had loc/referer broken for https ?
Proxy_redirect to replace with https plus some sub_filters did fix 99% of the UI. Still the jquery is broken, sounds similar to this and might be CSP to activate, to removing require-trusted-types-for 'script'; in the nginx config
but is has an explicit error the document needs trustedHTML
Ideas:
- read on the js error stack
- disable csp of SP
- try in chromium
- disable csp on nginx as we did in metabase
- see what's going on [with airflow nonce] (https://stackoverflow.com/a/42924000/3865083)
- disable crfl