您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Parsing KEGG database using biopython

編輯：Python

KEGG The database is called the genome encyclopedia , Is a containing gene, pathway And other comprehensive databases . For better query kegg data , The official provided the corresponding API.

stay biopython in , adopt Bio.KEGG modular , Yes kegg Official API It was packaged , Allow in python Use in the environment kegg API.KEGG API And python The corresponding relationship of the code is as follows

/list/hsa:10458+ece:Z5100 -> REST.kegg_list(["hsa:10458", "ece:Z5100"])

/find/compound/300-310/mol_weight -> REST.kegg_find("compound", "300-310", "mol_weight")

/get/hsa:10458+ece:Z5100/aaseq -> REST.kegg_get(["hsa:10458", "ece:Z5100"], "aaseq")


1.
2.
3.

utilize REST modular , Can download API Any type of data supported , With pathway For example , Examples are as follows

>>> from Bio.KEGG import REST

>>> pathway = REST.kegg_get('hsa00010')


1.
2.

For the content obtained from the query , adopt read Method can be converted to plain text , Examples are as follows

>>> pathway = REST.kegg_get('hsa00010')

>>> res = pathway.read().split("\n")

>>> res[0]

'ENTRY hsa00010 Pathway'

>>> res[1]

'NAME Glycolysis / Gluconeogenesis - Homo sapiens (human)'

>>> res[2]

'DESCRIPTION Glycolysis is the process of converting glucose into pyruvate and generating small amounts of ATP (energy) and NADH (reducing power). It is a central pathway that produces important precursor metabolites: six-carbon compounds of glucose-6P and fructose-6P and three-carbon compounds of glycerone-P, glyceraldehyde-3P, glycerate-3P, phosphoenolpyruvate, and pyruvate [MD:M00001]. Acetyl-CoA, another important precursor metabolite, is produced by oxidative decarboxylation of pyruvate [MD:M00307]. When the enzyme genes of this pathway are examined in completely sequenced genomes, the reaction steps of three-carbon compounds from glycerone-P to pyruvate form a conserved core module [MD:M00002], which is found in almost all organisms and which sometimes contains operon structures in bacterial genomes. Gluconeogenesis is a synthesis pathway of glucose from noncarbohydrate precursors. It is essentially a reversal of glycolysis with minor variations of alternative paths [MD:M00003].'


1.
2.
3.
4.
5.
6.
7.
8.

In this way, the string can be parsed , To get the number corresponding to the path , name , Notes, etc . about KEGG Data analysis ,biopython Special parsing functions are also provided , But the analytic function is not complete , At present, it only covers compound, map, enzyme And so on . With enzyme Database, for example , Usage is as follows

>>> from Bio.KEGG import REST

>>> request = REST.kegg_get("ec:5.4.2.2")

>>> open("ec_5.4.2.2.txt", "w").write(request.read())

>>> records = Enzyme.parse(open("ec_5.4.2.2.txt"))

>>> record = list(records)[0]

>>> record


<
Bio.KEGG.Enzyme.Record
object
at
0x02EE7D18
>

>>> record.classname

['Isomerases;', 'Intramolecular transferases;', 'Phosphotransferases (phosphomutases)']

>>> record.entry

'5.4.2.2'


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

adopt biopython, We can not only in python Use in the environment kegg api, what's more , Can use python Logical processing of , To implement complex filtering logic , Search for example human in DNA Repair related genes , The basic idea is as follows

1. adopt list API obtain human be-all pathway Number ;

2. adopt get API Get each pathway, Analyze its description Information , Filter appears repair Keyword access ;

3. For screened pathways , The genes corresponding to this pathway were obtained by text analysis ;

The complete code is as follows

>>> from Bio.KEGG import REST

>>> human_pathways = REST.kegg_list("pathway", "hsa").read()

>>> repair_pathways = []

>>> for line in human_pathways.rstrip().split("\n"):

... entry, description = line.split("\t")

... if "repair" in description:

... repair_pathways.append(entry)

...

>>> repair_pathways

['path:hsa03410', 'path:hsa03420', 'path:hsa03430']

>>> repair_genes = []

>>> for pathway in repair_pathways:

... pathway_file = REST.kegg_get(pathway).read()

... current_section = None

... for line in pathway_file.rstrip().split("\n"):

... section = line[:12].strip()

... if not section == "":

... current_section = section

... if current_section == "GENE":

... gene_identifiers, gene_description = line[12:].split("; ")

... gene_id, gene_symbol = gene_identifiers.split()

... if not gene_symbol in repair_genes:

... repair_genes.append(gene_symbol)

...

>>> repair_genes

['OGG1', 'NTHL1', 'NEIL1', 'NEIL2', 'NEIL3', 'UNG', 'SMUG1', 'MUTYH', 'MPG', 'MBD4', 'TDG', 'APEX1', 'APEX2', 'POLB', 'POLL', 'HMGB1', 'XRCC1', 'PCNA', 'POLD1', 'POLD2', 'POLD3', 'POLD4', 'POLE', 'POLE2', 'POLE3', 'POLE4', 'LIG1', 'LIG3', 'PARP1', 'PARP2', 'PARP3', 'PARP4', 'FEN1', 'RBX1', 'CUL4B', 'CUL4A', 'DDB1', 'DDB2', 'XPC', 'RAD23B', 'RAD23A', 'CETN2', 'ERCC8', 'ERCC6', 'CDK7', 'MNAT1', 'CCNH', 'ERCC3', 'ERCC2', 'GTF2H5', 'GTF2H1', 'GTF2H2', 'GTF2H2C_2', 'GTF2H2C', 'GTF2H3', 'GTF2H4', 'ERCC5', 'BIVM-ERCC5', 'XPA', 'RPA1', 'RPA2', 'RPA3', 'RPA4', 'ERCC4', 'ERCC1', 'RFC1', 'RFC4', 'RFC2', 'RFC5', 'RFC3', 'SSBP1', 'PMS2', 'MLH1', 'MSH6', 'MSH2', 'MSH3', 'MLH3', 'EXO1']


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.

adopt biopython, It can be used more efficiently KEGG API, combination API Data acquisition capability and python Logical processing capability , To meet our personalized analysis needs .‍

·end·

— If you like , Share it with your friends —

Pay attention to our , Unlock more ！