I'm trying to match two different data with Python, but I don't know which method to use.

It's the same machine, but there's two data collected from different sensors (I'll call it A, B). Both data were collected in seconds. There is one column that collected the same information from the A and B sensors to create new C data by combining some information from the A and B data, so I am going to match it based on that. In fact, when we graphically represented the same information columns of the same A and B sensors on the same date, we found that the approximate outline matches.

This is where the problem occurs.

A. An error in the instrument results in an interval in which data collection is missing on the A or B instrument on the same date

B. Due to the problem of A, the length of A and B data on the same date is different, some are collected a little faster, and some are pushed back.

C. There is a slight error due to the difference in sensors (ex1. If 556 is recorded in A, there is a +-5 difference in B/ex2. Both sensors A and B did not collect data at the same speed after operating the device)

The problem with D.C. also results in a difference in the length of the entire dataset.

In this situation, I would really appreciate it if you could advise someone with similar experience or let me know the related data matching algorithm.

python data

2023-01-09 15:41

1 Answers

I'm also in the middle of chilling, so I can't shake my mouth because it's working. First, I'm looking for a set of similarities between groups and groups with the A* algorithm, taking the idea of t-sne.

What this means is,

After obtaining the "statistical position" on several indicators,

Recognize the relationship between statistical locations by this item as the problem of finding the most similar combination of sets between A and B. (the lowest distance)

If this distance is obtained for all cases,

Item by item becomes a two-dimensional matrix,

It's a matter of choosing rows and columns, and finding a set with minimal combinations.

This is being approached by the A* algorithm.

Heuristic is set to the sum of the minimum values for each remaining row in the residual distance matrix.

If the number of items exceeds 30, it takes 8 hours to run each time, so

Configure the priority queue to hold one socket thread, then

The remaining threads were parallelized by taking values from the queue, calculating them, and then putting them back into the queue through the socket.

After drawing the distribution of the entire distance map, the top 30% value was captured with the threshold and woven without expanding.

So, can I squeeze it like this?

There's still something lacking. I need to do more, but I'm doing it like this.

I hope it helps you with your work.

2023-01-09 21:39

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656