If there is data in a list in one column, is there a way to divide each element in the list, increase the column, and make each element a dummy variable?

Asked 2 years ago, Updated 2 years ago, 55 views

I'm a beginner at data analysis.
I'm trying something like a simple competition for the first time.
I didn't know what to start with, so I groped one by one, but I didn't know how to find the answer to this.
As a result, the title has not been translated into a language well, and I think it has become difficult to understand.Sorry.

The content of the competition is the prediction of the price of the private lodging service.Information about the property and price data are provided.
"I was scrutinizing the contents of the data, and there was a column ('amenities') about ""facilities and amenities"" among them."
This column holds the data in a list format as shown below.

{TV, "Wireless Internet", Kitchen, "Free parking on premises", Washer, Dryer, "Smoke detector"}
{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door","24-hour check-in",Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}

Especially, it is hard to believe that the quality of this amenity will affect the price, but I would like to check it just in case, and I would like to do the following.

  • From all data (approximately 50,000 records), extract the elements listed in the amenities without excess or deficiency, and create a new column for the elements.
  • Set 1 if the original 'amenities' contains a value for each newly created column, or 0 if not.

I checked to see if it is possible to do this kind of operation on Pandas, but I have no idea.
I think I can do 1. if I touch CSV directly, but I can't think of a way to do 2. mechanically.

I have two questions.

python pandas csv

2022-09-29 21:39

3 Answers

Maybe it's Airbnb data.

Each data in this column is separated by ,.Also, elements such as Wireless Internet that contain two or more words and contain blank characters are bound by ".Therefore, follow these steps:

1. and 2. are the tasks of erasing three types of characters: {}"Pandas can use the .str.replace() method.
If you want to dummyize string data containing delimiters, as shown in 3, Pandas can easily use the .str.get_dummy() method.

The following is an example of execution.The method chain uses regular expressions to delete three types of {}" characters and then dummyize them (data from this page was used).

In[2]: df=pd.read_csv('lists.csv')
   ...: s=df ['amenities']
   ...: s.head()
Out [2]:
0 {TV, "Cable TV", Internet, "Wireless Internet", "A...
1 {TV, Internet, "Wireless Internet", Kitchen, "Free...
2 {TV, "Cable TV", Internet, "Wireless Internet", "A...
3 {Internet, "Wireless Internet", Kitchen, "Indoor...
4 {TV, "Cable TV", Internet, "Wireless Internet", Ki...
Name:amenities, dtype:object

In[3]: s.str.replace(r'{|}|"',",regex=True).str.get_dummy(',')
Out [3]:
      24-Hour Check-in Air Conditioning Breakfast...Washer/Dryer Wheelchair Accessible Wireless Internet
0                    0                 1          0  ...               0                      0                  1
1                    0                 0          0  ...               0                      0                  1
2                    0                 1          0  ...               0                      0                  1
3                    0                 0          0  ...               0                      0                  1
4                    0                 0          0  ...               0                      0                  1
...                ...               ...        ...  ...             ...                    ...                ...
3813                 0                 1          0  ...               0                      1                  1
3814                 1                 0          0  ...               0                      0                  1
3815                 0                 0          1  ...               0                      0                  1
3816                 0                 0          0  ...               0                      0                  1
3817                 0                 0          0  ...               0                      0                  1

[3818 rows x 41 columns]


2022-09-29 21:39

Get Ready

Data Series consisting of the following strings

>>import pandas as pd
>>> amenity=pd.Series([
    '{HOGE, "PIYO", HUGA},
    '{HOGE, "PIYO", HUGA},
])
>>> amenity
0 {HOGE, "PIYO", HUGA}
1 {HOGE, "PIYO", HUGA}
dtype:object

>>table=str.maketrans("", "", '{\"}')
>>>amenity=amenity.str.translate(table).str.split(",")
>>> amenity
0 [HOGE, PIYO, HUGA]
1 [HOGE, PIYO, HUGA]
dtype:object

converts it into a data series of strings.

Problem Settings

Based on the above preparations, assume that you would like to do the following:

The following data structures

 amenity=[
    ["A", "B"],
    ["B", "C"],
    ["A"],
    ["B", "C"],
]

to

AB C
0  1  1  0
1  0  1  1
2  1  0  0
3  0  1  1

I would like to create a data structure similar to .

In my case, I often do the following operations.

 Python 3.8.11 (default, Aug 321, 06:49:12) MSC v.1916 64bit (AMD64):: Anaconda, Inc. on win32
Type "help", "copyright", "credits" or "license" for more information.
>>import pandas as pd
>>>pd.__version__
>>>pd.__version__     
'1.3.1'
>>>df=pd.DataFrame({
    "amenity": [
        {"A", "B"},
        {"B", "C"},
        {"A",
        {"B", "C"},
    ]
})# Sample data creation
>>df

  amenity
0 {A,B}
1 {C,B}
2 {A}
3 {C,B}

>>df.amenity
      Convert from .map(lambdas:{k:True forkins})#set to dictionary
      .apply(pd.Series)#expand element
      .fillna(False)#Defect value to False

       ABC Corporation
0 True False
1 False True True
2 True False False
3 False True True

Note: Due to readability, you can't do it in a copy because you've added line breaks and comments in the right place. Please make the necessary corrections yourself.

The scikit-learn also provides a Transformer to do this.
I am not sure if it is more common to use pandas or scikit-learn processing, but just for your information.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

If you search "pandas list column dummy",

https://stackoverflow.com/questions/29034928/pandas-convert-a-column-of-list-to-dummies

became a hit.
Again, it seems that the solution is proposed to apply pd.Series of pd.Series and MultiLabelBinarizer.


2022-09-29 21:39

Use pandas.get_dummies.

import pandas as pd

amenities = pd.Series ([
  '{TV, "Wireless Internet", Kitchen, "Free parking on premises", Washer, Dryer, "Smoke detector"}',
  '{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Free parking on premises",Heating,Washer,Dryer,"Smoke detector","Carbon monoxide detector","First aid kit",Essentials,Shampoo,"Lock on bedroom door","24-hour check-in",Hangers,Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_49","translation missing: en.hosting_amenity_50"}',
])

category=pd.get_dummy(
  amenities.str.strip('{}'
           .str.split(r'?"?"?"', expand=True)
           .stack()
) .groupby(level=0).sum()

category.T.to_markdown()


2022-09-29 21:39

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.