Error in converting object type data to int type

Asked 2 years ago, Updated 2 years ago, 36 views

The data I read in csv was read as an object type, and when I converted it to int type, I stumbled due to an error like the one in the attachment.
I have looked into various methods, but I cannot solve them, so I would appreciate it if you could tell me how to solve them.

Perhaps the blank is affected, so I changed it to np.NaN in replace and then int type conversion in astype, but it is not working well.

csv files:
https://drive.google.com/file/d/1a0uwjrnOBXi0MIpmpeY6UpyeL4YVYIlL/view?usp=sharing

Execution Code

main=pd.read_csv("sample data.csv")
main.head()

contain a string type
print(main['PS_01_B00004031'].dtype)
# object

I want NaN (Not a Number)
main=main.replace(',np.nan)
main

convert to a numerical type
main=main.astype('int', errors='ignore')
main

main.dtypes.value_counts()
## int32967
object603
float6434
dtype —int64

Results
Object is left

python pandas

2022-09-30 10:43

1 Answers

As mentioned in the article in the comment, if you delete the blank space and convert it to Int64, all columns will be of the integer Int64 type, and the blanks will be <NA>.
As @metropolis commented, NaN is a floating point Not a Number in np.nan of numpy, so if it is included, it will not be an integer type.
Nullable integer data type

Changed in version 1.0.0: Now uses pandas.NA as the missing value rather than numpy.nan.

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

Characteristics of pd.NA in pandas 1.2.0+
Handle integer types containing missing values in Pandas

This is how you can specify it on a single line when loading.

df=pd.read_csv('sample data.csv', skipinitialspace=True,dtype='Int64')

Here's the result of using the sample data presented in the link.

 Python 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v. 1929 64bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>import pandas as pd
>>>df=pd.read_csv ('sample data.csv', skipinitialspace=True, dtype='Int64')
>>df
    SampleID SEX_CDAGE MARRIAGE CHILD_CDCHILD_CD2CHILD_AGE_1 ...PS_01_B00004165 PI_02_B00004165 PS_02_B00004165 PI_01_B00004166 PS_01_B00004166 PI_02_B00004166 PS_02_B004166
0 1160001 1 48 1 25 <NA> ... 5 5 5 4 4
11160002 151 22 <NA>...4 <NA>>NA>54 <NA>NA>
21160003 255 2125 ...<NA>45>NA>44
3    1160005       1   39         2         1          2           14  ...                4                5                4                5                4                5                4
4    1160006       1   56         2         1          2           26  ...                4                5                4                4                4                5                4
5    1160009       1   59         2         1          2           31  ...                5                5                4                5                4                5                4
6    1160010       2   57         2         1          2           31  ...                5                5                5                5                4                5                4
7    1160014       1   48         2         1          1           22  ...                3                3                3                4                4                3                4
8    1160015       1   53         2         1          3           15  ...                4                5                4                5                4                5                4
91163964 1342 <NA>NA> ... 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
101163965 2222 <NA> <NA> ...<NA>NA> NA> NA> NA> NA> NA> NA> NA> NA> NA> NA> NA> NA> NA> NA>
111163966 1 20 1 25 <NA> ... 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
12   1163969       1   49         2         1          1            9  ...                5                5                5                5                5                5                5
13   1163970       2   37         2         1          1            4  ...                5                5                5                5                5                5                5
141163972 1392 <NA>NA> ... 4 3 5 4 4
151163974 130 22<NA>NA>...43 3 3 4 34
161163975 15 8 1 25 <NA> ...45 4 5 45 4
171163977 128 12<NA>NA>...5 3 5 3 4 34
181163980 122 12<NA>NA>...4>NA>NA>34>NA>NA>NA>
191163981 2 20 1 25 <NA> ... 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
201163983 2 31 1 25 <NA> ... 5 4 5 4 4 4 4

[21 rows x 1604 columns]
>>df.dtypes.value_counts()
Int641604
dtype —int64
>>


2022-09-30 10:43

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.