您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python Trino fails to execute hive insert overwrite

編輯：Python

Use python Of trino Package execution insert overwrite, however overwrite But there is no effective problem

according to trino The official website of insert overwrite Of Opening mode , Turn on hive Of insert overwrite conversation , Make the current session insert into Statement support insert overwrite, That is, it supports the function of inserting data and updating according to partition coverage

But in use Python Code execution is never effective , It took a lot of time debug after , Finally, I found the reason. It was because I opened insert overwrite session The statement of does not execute fetchall() As a result, it did not take effect

The following is a code example of the execution

import trino
def insert_overwrite_query(trino, ...):
conn = trino.dbapi.connect()
cursor: Cursor = conn.cursor()
overwrite_sql = "SET SESSION hive.insert_existing_partitions_behavior = 'OVERWRITE'"
cursor.execute(overwrite_sql)
query = """insert into hive.schema.table ..."""
cursor.execute(query)
res = cursor.fetchall()
...

overwrite_sql After execution, you need to execute cursor.fetchall() Will take effect , I really don't understand this before sql Why fetchall()？fetchall() The function of is not executed only when data needs to be returned ？

After a search, I learned that fetchall() The real operating mechanism of

Usually the result set of a query is stored in memory , Good database connection processing generally stores the query result set in the memory of the server , Only when receiving the request will it be returned to the client , That is to say, only when fetchall() After the front insert overwrite session It will also be executed exactly

The other is python trino Medium fetchall() The role of the , After observing the source code , To determine the trino Query mechanism of

The query is performed in batches , Data with a large amount of data will be used first post Request to query a batch of data , Then next, if the client needs the rest of the results , Will use get The request ends after querying all the following data or the connection is interrupted

according to cursor.fetchall() Found that its data actually comes from a generator object TrinoResult, stay fetchall() In fact, it was executed list(TrinoResult(...)) The query results are obtained （ In fact, you do list(cursor.execute(query)) Will also get fetchall() Same result ）

class TrinoResult(object):
def __init__(self, query, rows=None, experimental_python_types: bool = False):
self._query = query
self._rows = rows or []
self._rownumber = 0
self._experimental_python_types = experimental_python_types
...
def __iter__(self):
# Initial fetch from the first POST request
for row in self._rows:
self._rownumber += 1
yield row
self._rows = None
# Subsequent fetches from GET requests until next_uri is empty.
while not self._query.finished:
rows = self._query.fetch()
for row in rows:
self._rownumber += 1
logger.debug("row %s", row)
if not self._experimental_python_types:
yield row
else:
yield self._map_to_python_types(row, self._query.columns)

All in all , stay session sql It needs to be executed once later fetchall() Will enable session Of sql take effect , But I always feel a little out of intuition