In the vcf file, there are parts written with # in the first few lines, and when you read it, you print # as well.As a result, the counter counts up to # like {('3':987,'7':654,...'#~':1,'#~':1)}
.Is there a way to erase this #?
Also, is it possible to make the order of counts in the order of numbers 1, 2, 3 instead of the order of the most?
import sys
importos
from collections import Counter
count = [ ]
with open('test.vcf', 'r') as file:
lines=file.read().split('\n')
For line in lines:
a=line.split('\t')
CHR = a [0]
count.append (CHR)
c=Counter(count)
print(c)
The Counter class is a dictionary subclass and can be implemented using it as described in the previous answer.
Specifically, it is possible to list the keys in dict.keys(), and then del.
Dictionaries are also not possible to sort because they have no concept of order, but they can also be addressed by sorting and accessing keys in any order.
c_keys=c.keys().sort()
forkinc_keys:
ifk[0] == '#':
delc[k]
else:
print(k,c[k])
You can do this with a basic library and pandas
.
import sys
importos
import pandas aspd
# File Read & Delete Comments (assuming '#' is not in valid data)
lines = [ ]
with open('test.vcf', 'r') as file:
alltext=file.readlines()
for line in alltext:
US>str=line.strip().split('#')# Remove the space before and after the line and divide by '#'
if str[0]:#List only non-comments
lines.append(str[0])
# List the first data separated by tabs in each row
count = [ ]
For line in lines:
a=line.split('\t')
count.append(a[0])
# DataFrame by counting the number of data appearances
s=pd.Series(count)
vc=pd.value_counts(s)
df = vc.rename_axis('CHR').to_frame('counts')
# Get the maximum length of the data string
idx=list(df.index)
n=len(max(idx,key=len))
# Extract and right-align numerical data to sort
dfnum=df.query('CHR.str.isnumeric()', engine='python')
dfnum=dfnum.rename (index=lambdas:s.rjust(n))
dfnum=dfnum.sort_index()
# Extract string data, right-align and sort
dfstr=df.query('not CHR.str.isnumeric(), engine='python')
dfstr=dfstr.rename (index=lambdas:s.rjust(n))
dfstr=dfstr.sort_index()
# DataFrame concatenation of numeric and string
dfall = pd.concat ([dfnum, dfstr])
# result display
print(dfall)
After searching, it seems that there is a library that deals with .vcf files, so it might be an ant to use it.
cyvcf2/pysam/PyVCF
cyvcf2:fast, flexible variant analysis with Python
It looks old, but there's also a summary page like this.
Python Basics and Bioinformatics Python 3.4
It doesn't matter directly, but it also includes packages for specialized fields.
Bioconda
Use bioconda to install NGS-related software in bulk
Tried Bioconda
© 2024 OneMinuteCode. All rights reserved.