For each unique user ID in multiple systems, we would like to use a unique ID to link the same user ID.
I'm having a hard time deciding what logic to use in SQL, so could you give me some advice?
There is a table that associates the same user.Based on this, I would like to use a unique ID to distinguish between the same users.
It is difficult to pass the same ID to users who span more than three systems because they are associated with the same users on multiple systems.
Original Table Columns
[System ID1, User ID1, System ID2, User ID2]
Example of the value of the original table *There are tens of millions of lines
↓ For example, A005=B021=C101, so I want to use the same unique ID for these three
sys01,A001,sys02,B002
sys01, A003, sys02, B011
sys01, A005, sys02, B021
sys02, B001, sys03, C050
sys02, B021, sys03, C101
sys01, A001, sys03, C102
sys01, A003, sys03, C111
Table you want to create
[System ID, User ID, Unique ID]
sys01,A001,0001
sys01,A003,0002
sys01,A005,0003
sys02,B002,0001
sys02,B001,0004
sys02,B011,0002
sys02,B021,0005
sys03, C101,0003
・・・
I look forward to your kind cooperation.
sql google-bigquery
If there is no many-to-one support for user IDs between systems, and the number of systems itself does not increase or decrease dynamically, why don't we create a table with IDs for each system in the same line?
In other words,
is not a table like this. src, ID1, dst, ID2
------------------------
sys01, A001, sys02, B002
sys01, A003, sys02, B011
sys01, A005, sys02, B021
sys02, B001, sys03, C050
sys02, B021, sys03, C101
sys01, A001, sys03, C102
sys01, A003, sys03, C111
Wouldn't it be possible to solve this problem if I had the following data?The leftmost column is the primary key and the rest is nullable.
PK, sys01, sys02, sys03
-----------------------
1,A001,B002,C102
2,A003,B011,C111
3,A005,B021,C101
4, null, B001, C050
The current original table looks like an enumeration of the sides of an ID-enabled graph, so it seems that calculation costs a lot to find the connection components of the graph, which is the information you really want to know.By changing the way the data is held as above, wouldn't it be possible to solve this problem if the connection components were kept as a row?
If there are only tens of millions of rows in the original table, I think the calculation of bringing all the data from the table to the local environment, calculating it locally in the language of your choice, and inserting it into the new table will not take that long.Even if you pass the data as a string with the above notation, the original table is about 20 bytes per line, which means that even 10 million lines are about 200 million bytes 200 200 MB.
© 2024 OneMinuteCode. All rights reserved.