当前位置: 代码迷 >> python >> 如何访问与本地计算机不同的服务器上存在的 hadoop 文件系统上的文件?
  详细解决方案

如何访问与本地计算机不同的服务器上存在的 hadoop 文件系统上的文件?

热度:80   发布时间:2023-07-16 10:36:16.0

我有一台本地机器( local_user@local_machine )。 并且 hadoop 文件系统存在于不同的服务器上( some_user@another_server )。 hadoop 服务器中的用户之一名为target_user 如何从local_user@local_machine访问local_user@local_machine存在的target_user 更准确地说,假设在some_user@another_server上的 HDFS 中存在一个文件/user/target_user/test.txt local_user@local_machine访问/user/target_user/test.txt时我应该使用的正确文件路径是什么?

我可以使用hdfs dfs -cat /user/target_user/test.txt访问 hdfs 本身中的文件。 但是我无法使用我编写的从 HDFS 读取和写入的 python 脚本从本地机器访问文件(需要 3 个参数 - 本地文件路径、远程文件路径和读取或写入),很可能是因为我我没有给出正确的路径。

我尝试了以下方法,但它们都不起作用:

$ #local_user@local_machine

$ python3 rw_hdfs.py ./to_local_test.txt /user/target_user/test.txt read

$ python3 rw_hdfs.py ./to_local_test.txt some_user@another_server/user/target_user/test.txt read

都给出了完全相同的错误:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 377, in _make_request
    httplib_response = conn.getresponse(buffering=True)
TypeError: getresponse() got an unexpected keyword argument 'buffering'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: 


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 376, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 610, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 247, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3/dist-packages/six.py", line 685, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 560, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.5/http/client.py", line 1197, in getresponse
    response.begin()
  File "/usr/lib/python3.5/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.5/http/client.py", line 279, in _read_status
    raise BadStatusLine(line)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python_hdfs.py", line 63, in <module>
    status, name, nnaddress= check_node_status(node)
  File "python_hdfs.py", line 18, in check_node_status
    request = requests.get("%s/jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus"%name,verify=False).json()
  File "/usr/lib/python3/dist-packages/requests/api.py", line 67, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 53, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 468, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 426, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine('\x15\x03\x03\x00\x02\x02\n',))

更准确地说,假设在 some_user@another_server 上的 HDFS 中存在一个文件 /user/target_user/test.txt

首先,HDFS 不是一台机器上的单个目录。 因此,尝试像那样访问它是没有意义的。

其次,您使用的任何 Python 库都试图通过 WebHDFS 进行通信,您必须专门为集群启用该功能。

BadStatusLine中的BadStatusLine可能表明您正在处理 Kerberized 的安全集群,因此您可能需要一种不同的方式来读取文件

例如,PySpark 或 Ibis 项目