Df.loc impl by 1e-to · Pull Request #788 · IntelPython/sdc

1e-to · 2020-03-27T09:42:10Z

No description provided.

Rubtsowa · 2020-03-27T10:32:48Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    --------
+    .. literalinclude:: ../../../examples/dataframe/dataframe_loc.py
+       :language: python
+       :lines: 34-


It is better ':lines: 36-'

Rubtsowa · 2020-03-27T10:34:53Z

sdc/tests/test_dataframe.py

+                           "B": [3, 4, 1, 0, 222],
+                           "C": [3.1, 8.4, 7.1, 3.2, 1]}, index=idx)
+        pd.testing.assert_series_equal(sdc_func(df), test_impl(df), check_names=False)
+


Add test with index not contained in indices DF.

densmirn · 2020-03-27T10:30:46Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    --------
+    .. literalinclude:: ../../../examples/dataframe/dataframe_loc.py
+       :language: python
+       :lines: 34-


Suggested change

:lines: 34-

:lines: 36-

densmirn · 2020-03-27T10:32:43Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        raise TypingError('Operator getitem(). The index must be a single label, a  list or array of labels,\
+                          a slice object with labels, a boolean array or a callable. Given: {}'.format(idx))


Is getitem() correct? Shouldn't we use this message info in limitations block in docstring?

What's about this recommendation?

This must be so that the user understands that he entered something incorrect
The same is done in the series.

I meant:

Actually the operator is not getitem, it's loc.

The index must be a single label, a list or array of labels, a slice object with labels, a boolean array or a callable looks like a limitation, doesn't?

How to fix it then?

Suggested change

raise TypingError('Operator getitem(). The index must be a single label, a list or array of labels,\

a slice object with labels, a boolean array or a callable. Given: {}'.format(idx))

ty_checker = TypeChecker('Operator loc().')

ty_checker.raise_exc(idx, 'int', 'idx')

I meant to insert limitations block to docstring as it was done in 0e1ce3a#diff-37d3d013a811f054d85ea0713b88b1eeR1723-R1731. However idx can be only of integer according to the code.

I already do that
- Loc works with basic case only: single label
in limitations

You still raise exception with incorrect message.

densmirn · 2020-03-27T10:34:57Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+                if self._dataframe._index[i] == idx:
+                data_0 = pandas.Series(self._dataframe._data[0], index=self._dataframe.index)
+                result_0 = data_0.at[idx]
+                data_1 = pandas.Series(self._dataframe._data[1], index=self._dataframe.index)
+                result_1 = data_1.at[idx]
+                return pandas.Series(data=[result_0[0], result_1[0]], index=['A', 'B'], name=str(idx))


Please fix indentations.

densmirn · 2020-03-27T10:41:39Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    func_lines = ['def _df_getitem_single_label_loc_impl(self, idx):',
+                  '  for i in numba.prange(len(self._dataframe.index)):',
+                  '    if self._dataframe._index[i] == idx:']
+    if isinstance(self.index, types.NoneType):
+        func_lines = ['def _df_getitem_single_label_loc_impl(self, idx):',
+                      '  if -1 < idx < len(self._dataframe._data):']


You will have incorrect indentation if index is None:

if -1 < idx < len(self._dataframe._data): # 2 white spaces data_0 =... # 6 white spaces

densmirn · 2020-04-08T11:44:54Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    space = '  '
+    if isinstance(self.index, types.NoneType):
+        func_lines = ['def _df_getitem_single_label_loc_impl(self, idx):',
+                      '  if -1 < idx < len(self._dataframe._data):']
+        space = ''
+    results = []
+    result_index = []
+    for i, c in enumerate(self.columns):
+        result_c = f"result_{i}"
+        func_lines += [f"{space}    data_{i} = pandas.Series(self._dataframe._data[{i}], index=self._dataframe.index)",
+                       f"{space}    {result_c} = data_{i}.at[idx]"]
+        results.append(result_c)
+        result_index.append(c)
+    data = '[0], '.join(col for col in results) + '[0]'
+    func_lines += [f"{space}    return pandas.Series(data=[{data}], index={result_index}, name=str(idx))",


Better to rename space to indent something like that:

Suggested change

space = ' '

if isinstance(self.index, types.NoneType):

func_lines = ['def _df_getitem_single_label_loc_impl(self, idx):',

' if -1 < idx < len(self._dataframe._data):']

space = ''

results = []

result_index = []

for i, c in enumerate(self.columns):

result_c = f"result_{i}"

func_lines += [f"{space} data_{i} = pandas.Series(self._dataframe._data[{i}], index=self._dataframe.index)",

f"{space} {result_c} = data_{i}.at[idx]"]

results.append(result_c)

result_index.append(c)

data = '[0], '.join(col for col in results) + '[0]'

func_lines += [f"{space} return pandas.Series(data=[{data}], index={result_index}, name=str(idx))",

indent = ' ' * 6

if isinstance(self.index, types.NoneType):

func_lines = ['def _df_getitem_single_label_loc_impl(self, idx):',

' if -1 < idx < len(self._dataframe._data):']

indent = ' ' * 4

results = []

result_index = []

for i, c in enumerate(self.columns):

result_c = f"result_{i}"

func_lines += [f"{indent}data_{i} = pandas.Series(self._dataframe._data[{i}], index=self._dataframe.index)",

f"{indent}{result_c} = data_{i}.at[idx]"]

results.append(result_c)

result_index.append(c)

data = '[0], '.join(col for col in results) + '[0]'

func_lines += [f"{indent}return pandas.Series(data=[{data}], index={result_index}, name=str(idx))",

densmirn · 2020-04-08T11:49:42Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+    Limitations
+    -----------
+    - Parameter ``'name'`` in new DataFrame can be String only
+    - Loc works with basic case only: single label


Suggested change

- Loc works with basic case only: single label

- Parameter ``idx`` is supported only to be a single value, e.g. :obj:`df.loc['A']`.

densmirn · 2020-04-08T11:50:13Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+
+    Limitations
+    -----------
+    - Parameter ``'name'`` in new DataFrame can be String only


What does the parameter name mean?

In this case it means that result series (if it series) has name string
Maybe need change limitation to more understanding

What is difference between Pandas and SDC in case of name of the Series?

Doesn't we support numeric name for the Series?

Yes, we support only string name series
And if base series containe numeric name, we change it into string

pep8speaks · 2020-04-08T13:02:08Z

Hello @1e-to! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-20 16:17:24 UTC

AlexanderKalistratov · 2020-04-16T12:53:46Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+            data_0 = []
+            for i in numba.prange(len(idx_list)):
+                index_in_list_0 = idx_list[i]
+                data_0.append(self._dataframe._data[0][index_in_list_0])


you can't do append in prange loop. Also you could use sdc_take. @kozlov-alexey

AlexanderKalistratov · 2020-04-16T12:55:41Z

sdc/datatypes/hpat_pandas_dataframe_functions.py

+        def _df_getitem_single_label_loc_impl(self, idx):
+            idx_list = []
+            for i in range(len(self._dataframe.index)):
+                if self._dataframe._index[i] == idx:


What would happen if _index is None?

Also it is better to do it in parallel. Split it into chunks, create list per chunk and then merge them

What would happen if _index is None?
It is okay because I dont use dataframe._index in case of index = None

AlexanderKalistratov · 2020-04-16T12:56:05Z

@1e-to conflict

AlexanderKalistratov · 2020-04-17T19:29:16Z

sdc/functions/numpy_like.py

+                if arr[j] == idx:
+                    res += 1
+            length += res
+            arr_len[i] = res


You could allocate list of list. The length of the first list is equal to number of chunks. In this case you could safely use append for list related to chunk. So in this case single loop would be enough. (and another loop (probably, not parallel) to merge all lists into one)

Df.loc impl

ce783ae

1e-to added the Ready for Review label Mar 27, 2020

1e-to requested a review from densmirn March 27, 2020 09:42

Rubtsowa reviewed Mar 27, 2020

View reviewed changes

densmirn suggested changes Mar 27, 2020

View reviewed changes

densmirn added Waiting on author and removed Ready for Review labels Mar 27, 2020

small fixes

5715fc0

1e-to added Ready for Review and removed Waiting on author labels Apr 7, 2020

densmirn added Waiting on author and removed Ready for Review labels Apr 8, 2020

fix

e6b3ab1

densmirn reviewed Apr 8, 2020

View reviewed changes

fix

05de3d2

pep

2e3e8da

1e-to added Ready for Review and removed Waiting on author labels Apr 10, 2020

densmirn approved these changes Apr 10, 2020

View reviewed changes

etotmeni added 3 commits April 14, 2020 15:06

add attr

9648125

add case of return dataframe

721f060

unify return values

d861e3f

AlexanderKalistratov reviewed Apr 16, 2020

View reviewed changes

1e-to and others added 3 commits April 16, 2020 21:17

Merge branch 'master' into dfloc

1623b99

fix with sdc_take

543d4ed

add find idx with chunks

28d6b38

AlexanderKalistratov reviewed Apr 17, 2020

View reviewed changes

Add support list of lists in sdc_take

723b6ed

pep

5477a98

AlexanderKalistratov merged commit 669443c into IntelPython:master Apr 21, 2020

		raise TypingError('Operator getitem(). The index must be a single label, a list or array of labels,\
		a slice object with labels, a boolean array or a callable. Given: {}'.format(idx))

	- Loc works with basic case only: single label
	- Parameter ``idx`` is supported only to be a single value, e.g. :obj:`df.loc['A']`.

Conversation

1e-to commented Mar 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

densmirn Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-04-20 16:17:24 UTC

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov commented Apr 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

densmirn Apr 8, 2020 •

edited

Loading

pep8speaks commented Apr 8, 2020 •

edited

Loading