Date:
Estimated Time:2 minutes
Solr joins thought Spark
→ Context
SolR is great to search within complexes documents. However it is weak to make complexes joins between documents and lack for expressive languages. Combining both solr and spark let envisage having best of two worlds.
→ Dataset
patient_id | visit_id |
1 | 1 |
1 | 2 |
2 | 1 |
2 | 2 |
→ R1
patient_id | visit_id |
1 | 2 |
2 | 1 |
2 | 2 |
→ R2
patient_id | visit_id |
1 | 2 |
2 | 2 |
→ Use case: Q1 AND Q2
Let's say you have defined two queries Q1 and Q2 which have respectively R1 and R2. Now you d'like to know which patients have both kind of visits.
SELECT R1.patient_id, R1.visit_id
FROM R1
JOIN R2 USING (patient_id)
WHERE AND
!= R1.visit_id R2.visit_id
→ Use Case: Q1 OR Q2
SELECT patient_id, visit_id FROM R1)
(UNION
SELECT patient_id, visit_id FROM R2) (
→ Use Case: Q1 AND NOT Q2
SELECT patient_id, visit_id FROM R1)
(MINUS
SELECT patient_id, visit_id FROM R2) (
→ Use Case: Q1 THEN Q2
Say we also have date along with data.
Say you want all Patients who had Q2 >1> Q1. You need to have the dates.
SELECT R1.patient_id, R1.visit_id, R1.date
FROM R1
JOIN R2 USING (patient_id)
WHERE TRUE
AND R2.visit_id != R1.visit_id
AND R2.date > R1.date + 1
→ Operations
Given this formula: (Q1 AND Q2) <2< Q3
WITH
as (SELECT 1 as patient_id, 2 as visit_id, 3 as date
Q1 UNION ALL
SELECT 2,1,2
UNION ALL
SELECT 2,2,4),
as (SELECT 1 as patient_id, 2 as visit_id, 3 as date
Q2 UNION ALL
SELECT 2,2,4),
as (SELECT 1 as patient_id, 2 as visit_id, 5 as date
Q3 UNION ALL
SELECT 2,2,6),
as (SELECT Q1.patient_id
T1 FROM Q1
JOIN Q2 USING(patient_id)
WHERE TRUE
AND Q2.visit_id != Q1.visit_id),
as (SELECT Q1.patient_id
T2 FROM Q1
JOIN Q3 USING(patient_id)
WHERE TRUE
AND Q3.visit_id != Q1.visit_id
AND Q3.date > Q1.date + 2),
as (SELECT Q2.patient_id
T3 FROM Q2
JOIN Q3 USING(patient_id)
WHERE TRUE
AND Q3.visit_id != Q2.visit_id
AND Q3.date > Q2.date + 2)
SELECT * FROM T1)
(INTERSECT
SELECT * FROM T2)
(INTERSECT
SELECT * FROM T2) (
Given this formula: (Q1 OR Q2) AND Q3
WITH
as (SELECT 1 as patient_id, 2 as visit_id, 3 as date
Q1 UNION ALL
SELECT 2,1,2
UNION ALL
SELECT 2,2,4),
as (SELECT 1 as patient_id, 2 as visit_id, 3 as date
Q2 UNION ALL
SELECT 2,2,4),
as (SELECT 1 as patient_id, 2 as visit_id, 5 as date
Q3 UNION ALL
SELECT 2,2,6),
as (SELECT Q1.patient_id
T1 FROM Q1
UNION
SELECT Q2.patient_id
FROM Q2),
as (SELECT Q3.patient_id
T2 FROM Q3)
SELECT * FROM T1)
(INTERSECT
SELECT * FROM T2) (
Given this formula: Q1 AND Q2 AND Q3
WITH
as (SELECT 1 as patient_id, 1 as visit_id),
Q1 as (SELECT 1 as patient_id, 2 as visit_id),
Q2 as (SELECT 1 as patient_id, 3 as visit_id),
Q3 as (SELECT Q1.patient_id
T1 FROM Q1
JOIN Q2 USING(patient_id)
WHERE TRUE
AND Q2.visit_id != Q1.visit_id),
as (SELECT Q2.patient_id
T2 FROM Q2
JOIN Q3 USING(patient_id)
WHERE TRUE
AND Q3.visit_id != Q2.visit_id),
as (SELECT Q1.patient_id
T3 FROM Q1
JOIN Q3 USING(patient_id)
WHERE TRUE
AND Q3.visit_id != Q1.visit_id)
SELECT * FROM T1)
(INTERSECT
SELECT * FROM T2)
(INTERSECT
SELECT * FROM T3) (
Given this formula: Q1 <1< Q2 <1< Q3
WITH
as (SELECT 1 as patient_id, 1 as visit_id, 3 as date),
Q1 as (SELECT 1 as patient_id, 2 as visit_id, 4 as date),
Q2 as (SELECT 1 as patient_id, 3 as visit_id, 5 as date),
Q3 as (SELECT Q1.patient_id
T1 FROM Q1
JOIN Q2 USING(patient_id)
WHERE TRUE
AND Q2.visit_id != Q1.visit_id
AND Q2.date >= Q1.date + 1),
as (SELECT Q2.patient_id
T2 FROM Q2
JOIN Q3 USING(patient_id)
WHERE TRUE
AND Q3.visit_id != Q2.visit_id
AND Q3.date >= Q2.date + 1)
SELECT * FROM T1)
(INTERSECT
SELECT * FROM T2) (
Given this formula: Q1 <1< Q2 AND NOT Q3
WITH
as (SELECT 1 as patient_id, 1 as visit_id, 3 as date),
Q1 as (SELECT 1 as patient_id, 2 as visit_id, 4 as date),
Q2 as (SELECT 1 as patient_id, 3 as visit_id, 5 as date),
Q3 as (SELECT Q1.patient_id
T1 FROM Q1
JOIN Q2 USING(patient_id)
WHERE TRUE
AND Q2.visit_id != Q1.visit_id
AND Q2.date >= Q1.date + 1)
SELECT * FROM T1)
(MINUS
SELECT * FROM Q3) (
→ Isolating matching records
For each Qn it would be interesting to get the matching visit. This allows to run the solr queries afterwards.