A logic that uses SQL to link the IDs of the same users of multiple systems and shake them.

Asked 2 years ago, Updated 2 years ago, 405 views

For each unique user ID in multiple systems, we would like to use a unique ID to link the same user ID.
I'm having a hard time deciding what logic to use in SQL, so could you give me some advice?

There is a table that associates the same user.Based on this, I would like to use a unique ID to distinguish between the same users.
It is difficult to pass the same ID to users who span more than three systems because they are associated with the same users on multiple systems.

  • The operation is performed in SQL (because of BigQuery, there is no problem with some heavy operation).Also, it's almost the same as standard SQL, so you can think of it as standard SQL.)
  • Because SQL itself runs in a separate program, multiple SQLs can be run in order.However, due to the large number of data, it is difficult to get the data from the program side through SELECT and process it.(If it's a list of system IDs, it's possible to get them because the number is small.)
  • Since there are nearly 100 target systems, it is difficult because it takes about 5,000 loops to repeat, "First, extract and link only the data for the two times.
     (If you really need a loop, at least the number of systems = 100 times.)
  • On two systems, users are tied one-to-one.(Some users don't get tied up, but not one-to-many)
  • The values in the column [System ID1] and [System ID2] are shown as [System ID1]
  • Due to the nature of BigQuery, UPDATE, INSERT, and DELETE statements are restricted and should not be used
  • You can create an intermediate table
  • You can also use the window function

Original Table Columns

 [System ID1, User ID1, System ID2, User ID2]

Example of the value of the original table *There are tens of millions of lines
↓ For example, A005=B021=C101, so I want to use the same unique ID for these three

sys01,A001,sys02,B002
sys01, A003, sys02, B011
sys01, A005, sys02, B021
sys02, B001, sys03, C050
sys02, B021, sys03, C101
sys01, A001, sys03, C102
sys01, A003, sys03, C111

Table you want to create

 [System ID, User ID, Unique ID]
sys01,A001,0001
sys01,A003,0002
sys01,A005,0003
sys02,B002,0001
sys02,B001,0004 
sys02,B011,0002
sys02,B021,0005
sys03, C101,0003
・・・

I look forward to your kind cooperation.

sql google-bigquery

2022-09-30 21:56

1 Answers

If there is no many-to-one support for user IDs between systems, and the number of systems itself does not increase or decrease dynamically, why don't we create a table with IDs for each system in the same line?

In other words,

is not a table like this.
 src, ID1, dst, ID2
------------------------
sys01, A001, sys02, B002
sys01, A003, sys02, B011
sys01, A005, sys02, B021
sys02, B001, sys03, C050
sys02, B021, sys03, C101
sys01, A001, sys03, C102
sys01, A003, sys03, C111

Wouldn't it be possible to solve this problem if I had the following data?The leftmost column is the primary key and the rest is nullable.

PK, sys01, sys02, sys03
-----------------------
1,A001,B002,C102
2,A003,B011,C111
3,A005,B021,C101
4, null, B001, C050

The current original table looks like an enumeration of the sides of an ID-enabled graph, so it seems that calculation costs a lot to find the connection components of the graph, which is the information you really want to know.By changing the way the data is held as above, wouldn't it be possible to solve this problem if the connection components were kept as a row?

If there are only tens of millions of rows in the original table, I think the calculation of bringing all the data from the table to the local environment, calculating it locally in the language of your choice, and inserting it into the new table will not take that long.Even if you pass the data as a string with the above notation, the original table is about 20 bytes per line, which means that even 10 million lines are about 200 million bytes 200 200 MB.


2022-09-30 21:56

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.