Find one 32-bit integer that is not in a file that contains 32-bit integers in random order.

Asked 2 years ago, Updated 2 years ago, 59 views

Q. Suppose you have a file that contains up to 4 billion 32-bit integers in random order.Find one 32-bit integer that is not in this file.
From column 2.1.A, "Algorithms and data structures that understand the essence of the gem programming goo"

That's the question.

The solution in the book is

Divide the 4 billion integers into 0 and 1 with the first bit, and write them into two files.Each file should contain two billion integers.Then, the smaller integer in the output file is the new input file, and now look at the second bit and divide it into two files.And if you repeat it one after another, it will be output.

That's what it is.

This method cannot be coded in Python.I couldn't find anyone doing it this way, perhaps because of the bad way of looking it up.

Could you please provide me with a method or answer to code it in Python?

python algorithm

2022-09-30 10:19

4 Answers

If you use this algorithm, you must determine whether the ith bit of the 32-bit integer n is 0 or 1.In this algorithm, the ith bit can be counted from the higher bit or from the lower bit.Let's start with the lower bits.Also, for clarity of the program, we will count the first bit as the 0th bit unlike the commentary.

To determine whether the ith bit of n is 0 or 1, you can easily use bit operations: iOnly the first bit is 1 and the second bit is 0 and for each bit.

Consider using an example: n=42, i=3.

n=0b_0010_1010
           mask = 0b_0000_1000
       n&mask=0b_0000_1000
(n&mask)>>i=0b_0000_0001

In this way, you can determine whether the i bit is 0 or 1 by looking at the (n&mask)>i calculation results.

If I were to write down the details, this algorithm only looks at bit patterns, so you don't have to worry about the difference between Big Endian/Little Endian and the difference between 1 and 2 complements in signed integers.All you need to do is guarantee that even one bit is different and it is different as an integer.For example, for floating-point numbers, +0 and -0 have different but equal highest-order bits, so you cannot use this algorithm as it is.

If I were to write down the details, this algorithm only looks at bit patterns, so you don't have to worry about the difference between Big Endian/Little Endian and the difference between 1 and 2 complements in signed integers.All you need to do is guarantee that even one bit is different and it is different as an integer.For example, for floating-point numbers, +0 and -0 have different but equal highest-order bits, so you cannot use this algorithm as it is.

If you want to allow 4 billion random integers to be duplicated, simply add them to the file while generating 4 billion random integers. Python uses random.randint(a,b) to create random integers.To add to the file, the 'a' is a good option for open().

Below is a sample program.

In fact, creating 4 billion lines of files takes time, and the file size is huge (about 30GB?), so it's better to try a smaller number.


2022-09-30 10:19

I don't know where to read the first bit of an integer.

Use the built-in functions bin or format.
Embedded Functions—Python 3.7.3 Documentation

>>format (2**32-1, '032b')
'11111111111111111111111111111111'

I don't know how to create 4 billion 32-bit integer input files arranged randomly between 0 and 1 and how to use the separated file as a new input file and look at the second bit.Is it like making a large list?

Creating a file literally means creating a file, which means you don't get through memory when you create a list of 4 billion elements, so the goal is to eventually get through memory by loading files one line at a time.

Python creates and writes files as follows:

file_name='0.txt'
file_path='/tmp/{}.format(file_name)
with open(file_path, 'w') asf:# file open
    f.write('some_string')#Write

You should be able to implement them by combining them.


2022-09-30 10:19

According to the book method, 32 bit integers are 2^32 = 4,294,967,296 and the maximum number of integers is 4*10^9 = 4,000,000,000 so I have to make 4 billion worst files, so I honestly don't recommend it.For example, if you read a file with 0, 1, 2, ..., 3,999, 999, 999 and 32-bit integers, you will generate a lot of files.If you do that, the file system itself will be flat.

I think this algorithm is targeted at C language, but Python is an integer type that will eventually puncture memory if you read files 4 bytes at a time.To avoid this, it is recommended to have four 1-byte leads.I think there is a way to use numpy.

The 32-bit integer in the file may be represented by a single line of text, but this book probably refers to binary data, and since most PCs use Little Endian, the highest bit is bit 7 of the fourth byte of the 4-byte data (1 start).

It is recommended that you write a code to read/write binary data as a trial.For example, a code that writes one byte of data to a file, 37 ('%') looks like this.The same goes for the lead, but you should look it up with the search engine anyway.

import sys
import structure

out=open("out1.txt", "w")
out.write(struct.pack('B',37))
out.close()

Use a logical product operator (&) and so on to determine if the most significant bit is 0 or 1.

Anyway, if you actually want to implement the method described in the book, you have to start by deciding the name of the output file for determination.You will need to repeat this process 32 steps anyway, but to write this routine you will need to understand the concept of tree and the algorithm called depth-first search, or recursion.

Maybe the algorithm in the book is a joke for "Knowing People."It's worth considering just whether to write it in a file or put it in memory.I don't want to write such a program not only in Python but also in C language.

I'm sorry for the negative response.

There were many errors in my reply, so I will correct them.First of all, Python handles numerical types in an immutable way, so if you handle a large number of numbers, the memory will be punctured, which is actually not the case.

for i in range (0,1000000000):
    print(str(i))

This program works correctly on Python 3 systems.Each numeric data area will be released correctly when it is no longer needed, so there will be no memory exhaustion.

Next, we present a binary write program in C language because we find it very difficult to debug the program described in the title.A program that lights from 0x00000000 to 0xFFFFFFFF.

#include<stdio.h>

int main() {
    FILE* out;
    unsigned inti;
    unsigned intj;
    unsigned intend_j;
    unsigned int buf [0x400];

    out = fopen("out3.txt", "wb");
    i = 0;
    do{
        for(j=0;j<0x400;++j)
            buf[j] = i+j;
        fwrite(&buf,sizeof(unsigned int),0x400,out);
        i+=0x400;
    } while(i);
    fclose(out);

    return 0;
}

This program generates exactly 16GB of files, so run on storage with a large amount of disk space.I accidentally ran this on the C drive (SSD) and made Windows unstable.If you look at the binary data you created, you can see that the data was created correctly.

$od-N 100-t x1 out 3.txt
0000000 00 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00
0000020 04 00 00 00 05 00 00 00 06 00 00 00 07 00 00 00
000004 0800000000000000000000a 00000b 000000000000
00000600c 0000d 0000e 0000f 00000000
0000100 10 00 00 00 11 00 00 00 12 00 00 00 13 00 00 00
0000120 14 00 00 00 15 00 00 00 16 00 00 00 17 00 00 00
0000140 18 00 00 00
0000144
$ od-j 17179869084-N 100-t x1 out 3.txt
177777777634e7ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
177777777654ebffffffecffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
1777777767674efffffffff0ffffffff1ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
177777777714 f3ffffffff4ffffff5ffffff6ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
177777777734 f7ffffff8ffffffff9ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
17777777754 fbffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
177777777774fffffffffffffffffffffffffffffffffffffffffff
200000000000

The running time is 2-3 minutes.By the way, I ran it on Ubuntu on WSL.

$time./write3

real2m22.821s
user0m16.391s
sys2m6.141s


2022-09-30 10:19

I'm sorry, but I'm sorry after giving you a long answer, but I'm going to post something that seems to be an answer program.I haven't tested much, so there might be a bug.Thank you for your cooperation.

import structure

def filter(infile, outfile0, outfile1, bit_index):
    in1 = open(infile, "rb")
    out0 = open(outfile0, "wb")
    out1 = open(outfile1, "wb")
    cnt0 = 0
    cnt1 = 0
    mask=1<<bit_index
    data=in1.read(4)
    whilelen(data) == 4:
        n, = structure.unpack('I', data)
        if n&mask == 0:
            out0.write(data)
            cnt0+=1
        else:
            out1.write(data)
            cnt1+=1
        data=in1.read(4)
    in1.close()
    out0.close()
    out1.close()
    return cnt0,cnt1

cnt0, cnt1 = filter("in1.txt", "tmp0", "tmp1", 31)
if cnt0<=cnt1:
    state = 0
    pred = 0
else:
    state = 1
    pred = 1

for i in range (30, -1, -1):
    # print("i:"+str(i)+", pred:"+str(pred)+", state:"+str(state))
    if state == 0:
        cnt0, cnt1 = filter("tmp0", "tmp1", "tmp2", i)
        if cnt0<=cnt1:state=1
        else —state=2
    elif state==1:
        cnt0, cnt1 = filter("tmp1", "tmp2", "tmp0", i)
        if cnt0<=cnt1:state=2
        else —state = 0
    else:
        cnt0, cnt1 = filter("tmp2", "tmp0", "tmp1", i)
        if cnt0<=cnt1:state=0
        else —state=1
    pred<<=1
    if cnt0>cnt1:
        pred+=1

print(pred)


2022-09-30 10:19

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.